Rapid discovery of new Eu 2+ -activated phosphors with a designed luminescence color using a data-driven approach †

For rapid and eﬃcient development of new phosphors, a suitable method that proposes promising candidates is expected to focus time-consuming trial-and-error experiments. A data-driven approach to discover new phosphor materials with a designed luminescence color is demonstrated in this paper. To screen compounds for a desirable luminescence color, a machine learning model has been developed for predicting emission peak wavelengths from a dataset composed of 129 Eu 2+ -activated phosphors. General-purpose compositional and structural features are used to represent host compounds of phosphors. Bootstrap aggregation with the gradient boosted regression trees method is adopted to obtain high predictive performance and to avoid overfitting. The predictive performance of the machine learning model is estimated to be 25 nm of mean absolute error (MAE) and 33 nm of root mean squared error (RMSE) by 10-fold cross validation. To discover new green-emitting Eu 2+ -activated phosphors, twenty candidate compounds have been selected to have predicted emission peak wavelengths of about 500–550 nm from a materials database, and the candidates have been synthesized and characterized by experiments. Three new Eu 2+ -activated phosphors, Li 2 Ca 4 Si 4 O 13 :Eu 2+ , Na 2 Ca 2 Si 2 O 7 :Eu 2+ , and SrLaGaO 4 :Eu 2+ , successfully show green or blue-green emissions as designed.


Introduction
Phosphor-converted white light-emitting-diodes (pc-wLEDs), which are composed of blue or near-ultraviolet LED chips as a primary light source, and phosphors as down-conversion luminescent materials, are one of the indispensable lighting technologies today because of their high luminous efficiency, cost effectiveness, environment-friendliness, and spectral design flexibility. 1 For pc-wLED applications, phosphors have various requirements such as strong absorption of the LED light, suitable emission spectrum, high quantum efficiency, small thermal quenching/degradation, high chemical stability, and small luminance saturation.Ce 3+ and Eu 2+ ions are often selected as activators of the phosphors for the pc-wLEDs.These lanthanide ions utilize parity allowed 4f-5d transitions, which are often characterized by high radiative emission probability, short lifetime, and relatively broad absorption and emission spectra in contrast to parity forbidden 4f-4f transitions. 2urthermore, because their 5d-states are strongly influenced by the host lattices, their luminescence properties can be tuned by variation of the hosts.However, it requires time-intensive trial-and-error experiments to explore and optimize new phosphors.Even though several strategies have been proposed for efficient development of new phosphors, 3,4 an effective method to select candidate compounds for desirable properties is expected to focus the time-intensive experiments upon promising candidates.
][7][8][9][10] The emission spectrum is one of the most important characteristics of phosphors because it determines their luminescence color.The emission spectrum is often characterized by its peak top and full width at half maximum (FWHM).A relationship among host compounds, the absorption spectrum, and the emission spectrum has been investigated empirically or semi-empirically for Ce 3+ and Eu 2+activated phosphors so far. 11Ab initio multi-configurational quantum chemical calculations have been performed to quantitatively calculate configuration coordinate diagrams and absorption spectra. 12Constrained DFT calculations have also been conducted to evaluate absorption and emission energies. 13,14owever, these theoretical methods require time-consuming calculations at both the ground and excited states.Because of the high computational cost, high-throughput theoretical calculations to screen candidate compounds are not currently feasible.
7][8] Sohn and his coworkers reported the pioneering machine-learning study on a relationship among emission peak wavelength, FWHM, and local environments of substitution sites in host lattices, 5 and recently reported comprehensive machine learning to predict band gap, excitation energy, and emission energy for Eu 2+ -activated phosphors. 6Nakano et al. reported machine learning to predict emission peak energy from chemical compositions of the host compounds for Eu 2+activated phosphors. 7The reported prediction accuracy is not directly comparable among the theoretical calculations and the machine learning studies because they used different datasets.But the results suggest that the machine learning models 6,7 have comparable prediction accuracy to the DFT calculations. 14ased on the successful machine-learning studies to date, it is expected that new phosphors with desirable luminescence properties will be developed using machine learning.Although several research groups have reported new phosphors by datadriven approaches, 9 discovery of new phosphors with a designed luminescence color is still a big challenge.In this paper, we report the discovery of three new green or blue-green emitting phosphors, which a machine-learning model has proposed as green emitting phosphors.First, we developed a machine learning model to predict the emission peak wavelengths of Eu 2+ -activated phosphors from an in-house phosphor dataset.Next, we explored a materials database and collected candidate host compounds predicted to show green emissions by the machine learning model.Then, we synthesized and characterized the candidates, and finally discovered the three new Eu 2+ -activated phosphors, Li

Data collection
Even though phosphors have been intensively investigated so far, there is no readily available dataset of phosphor materials and luminescence properties.Therefore, a dataset of host compounds and emission peak wavelengths of Eu 2+ -activated phosphors was collected from the literature. 1,15Only host compounds with typical oxidation states and containing Ca, Sr, or Ba elements were selected.These alkaline earth metals are considered as substitution sites for Eu 2+ ions because they have the same valence and close ionic radii to Eu 2+ .Crystal structures of the hosts were collected from the inorganic crystal structure database (ICSD) 16 and AtomWork-Adv. 17Some structure data were modified as follows.(1) Structure data with chemical compositions that deviate from the ideal compositions of the hosts, for example containing Eu 2+ , was corrected to have the ideal compositions of the hosts.(2) Structure data with partially occupied sites and different site occupancies were modified to have high occupancy sites only.Partially occupied sites cause ambiguity in the representation of local environments of the substitution sites.Host compounds with awkward site occupancy, which cannot be simply discretized as described above, were dropped.
Emission peak wavelength is used as a target variable in this study because the emission spectra of phosphors are usually measured and reported in wavelength.The emission peak wavelengths depend on the concentrations of activators and other factors.The conditions in the literature are inconsistent, and the reported values vary more or less.If multiple emission peak wavelengths are reported for a single phosphor material and the reported values differ by more than 30 nm, the phosphor is eliminated.In our opinion, a deviation of 10 nm or more in the emission peak wavelength is conceivable due to the different conditions.
Finally, a dataset composed of 129 Eu 2+ -activated phosphors was prepared.The distribution and statistics of the emission peak wavelengths are respectively shown in Fig. 1a and Table 1.Constituent elements of the host compounds are summarized in Fig. 1b.Among the constituent elements, sulfur appeared as both a cation (S 6+ ) and an anion (S 2À ).N, O, F, Cl, Br, and I elements were anions, and the other elements were cations.

Host representation
Two sets of features were used to represent host compounds of Eu 2+ -activated phosphors.The first set is a representation of chemical compositions (compositional features, hereafter), and the second set is a representation of crystal structures, particularly local environments of substitution sites for Eu 2+ activators, from both geometrical and chemical aspects (structural features, hereafter).
As the compositional features, general-purpose features 18 were adopted.The general-purpose features were a set of statistics of elemental features to represent various aspects of chemical compositions.Nakano et al. used the same scheme for their machine learning. 7In this study, 22 elemental features and seven statistics, namely, weighted arithmetic mean, weighted geometric mean, weighted harmonic mean, weighted standard deviation, minimum, maximum, and range, were used.The elemental features and the statistics are respectively listed in Tables S1 and S2 in the ESI.† In addition to the elemental features, oxidation states were considered.As oxidation states are both positive and negative values and satisfy charge neutrality, the weighted arithmetic, geometric, and harmonic means were excluded.Instead, the seven statistics of absolute oxidation states were additionally included.As the hosts in this study are all ionic compounds, the statistics of the elemental features and the absolute oxidation states were also evaluated for each of the cations only and the anions only.The compositional features consisted of 487 features.
To represent the local environments of the substitution sites, Park et al. used geometrical and elemental features of activator-anion and activator-cation polyhedra. 6This idea was generalized, inspired by the general-purpose compositional features.The structural features used in this study consisted of three groups of features.The first group was a geometrical aspect of the substitution sites.The numbers of neighboring anions and cations, average distances to their neighboring anions and cations, distortion index, 19 and bond valence sum 20 were evaluated for individual Ca, Sr, and Ba sites.The neighboring anions were determined using the CrystalNN method. 21The neighboring cations were determined so that they shared neighboring anions with the substitution sites.As some of the host compounds used in this study have multiple substitution sites, the average and standard deviation of each feature among the substitution sites were evaluated and used as features of the host structures.The number of symmetrically inequivalent substitution sites was also included.The second group was analogous to the compositional features but specialized for the local environments of the substitution sites.The seven statistics of the 22 elemental features and the absolute oxidation states were calculated for the neighboring anions and the neighboring cations of individual Ca, Sr, and Ba sites.The average and standard deviation among the substitution sites were used as the features of the hosts.Besides the features of the substitution sites, density and numerical density were added as the third group.The structural features consisted of 659 features.
The features were evaluated using the Pymatgen package 22 and a customized version of the XenonPy package. 23

Machine learning
The general-purpose features used in this study were systematically calculated to represent various aspects of the host compounds, and thus a part of them were redundant and irrelevant to the emission peak wavelength.Therefore, feature selection was adopted before regression.First, features with low variance were dropped, and the passed features were standardized so that the means were zero and standard deviations were one.After the standardization, the features were roughly selected in the order of mutual information with the emission peak wavelength.The features were further narrowed down using recursive feature elimination (RFE) based on the importance of each feature obtained by a regression model.Finally, regression was conducted.The ridge, automatic relevance determination (ARD), random forest (RF), gradient boosted regression trees (GB), and bootstrap aggregation (bagging) of GB methods were applied for the regression.The regression  method used in RFE was the same as the final regression, except for the bagging of GB regression.For the bagging of GB regression, a single GB model was used in RFE to reduce computation time.The Scikit-learn package 24 was used for the machine learning.
The predictive performance of the machine learning models was evaluated by 10-fold cross validation by means of the mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R 2 ).The scores were averaged among the folds.The parameters of the regression models and the numbers of selected features were selected to minimize the average RMSE for the validation data.The parameter search was performed in a manner of Bayesian optimization using the Hyperopt 25 and scikit-optimize 26 packages with 1000 iterations for each method.Default parameters were used for the regression models used in RFE to reduce the computation time for the parameter search.The pipelines of the machine-learning models and the optimized parameters are summarized in Table S3 in the ESI.†

Experiments
Candidates of Eu 2+ -activated phosphors proposed by a machine learning model were synthesized and characterized by experiments.The phosphors were synthesized by a solid-state method.The starting materials (oxides or carbonates) of the host compounds were mixed with Eu 2 O 3 .The amount of Eu element was fixed at 2 at% of the substitution sites, namely, Ca, Sr, and Ba, in the hosts.The starting materials were fired in air, and then fired in a reducing atmosphere (in a carbon heater furnace filled with nitrogen).The firing temperatures and time were altered depending on the host compounds.
The products were first characterized using a powder X-ray diffractometer (XRD) (Bruker, D8 ADVANCE, Cu Ka radiation) and a spectrofluorometer (JASCO, FP-8600).The powder XRD analysis indicated that some products were mixtures of the target compounds and impurity phases.As the photoluminescence (PL) spectra of the powder samples are largely influenced by impurity phases with bright luminescence, it was not clear whether the PL spectra of the mixture products were derived from the target compounds or the impurity phases.Therefore, after the first screening using the powder samples, wellcrystallized particles were picked up from the products and characterized by single crystal XRD and microspectroscopy in a manner of the single-particle diagnosis approach. 4The single crystal XRD data of the picked particles were collected using a diffractometer (Bruker-AXS, SMART APEX II Ultra) with Mo Ka radiation.The data were integrated and corrected for absorption using SADABS.The crystal structures were solved and refined with SHELX.The PL spectra of the particles were obtained using a spectrometer (Otsuka electronics, MCPD7700) through a microscope (Olympus, BX51M) under 365 nm LED excitation.

Comparison of regression methods
Regression methods are compared in this section.MAE, RMSE, and R 2 for the training and validation data in the cross validation are summarized in Table 2. Fig. 2 illustrates predicted emission peak wavelengths with respect to the reported values in the cross validation.The ridge regression is the baseline model in this study.The R 2 of the ridge regression to the validation data, 0.74, suggests that the prediction accuracy was comparable to the previous studies, 6,7 although the results are not directly comparable due to the use of the different datasets.
To improve the predictive performance, other regression methods were applied.The ARD regression is a Bayesian linear model with an intrinsic feature selection capability, and this method resulted in a slightly higher prediction accuracy to the validation data compared with the ridge regression.The ridge and ARD models showed relatively large fitting errors to the training data.This indicates that the relationship between the general-purpose features used in this study and the emission peak wavelength is basically nonlinear, although the generalpurpose features are numerous and diverse.The small differences in the predictive performance scores between the training and validation data of these linear models imply that the obtained predictive performance almost reached the optimal of linear models.
Nonlinear regression methods were applied to obtain a higher predictive performance.The RF model showed slightly smaller MAE but larger RMSE to the validation data than the ARD model.The GB model showed much smaller MAE and RMSE to the validation data than the ARD and RF models.However, the fitting errors of the GB model to the training data were almost zero, and overfitting was concerned.To dispel the concerns about the overfitting of the GB model, the bagging technique was adopted to the GB regression.The bagging technique is also used in the RF regression and is expected to suppress the overfitting.The bagging of the GB model showed intermediate predictive performance to the validation data between the GB and RF models.The better predictive performance of the bagging of the GB model compared with the RF model is probably due to the higher predictive capacity of the GB regression as a base learner compared with that of the regression trees in the RF model.The RF, GB, and bagging of GB models showed large prediction errors for some specific compounds in the validation folds.A plausible cause of these large prediction errors is that the phosphor dataset used in this study is not sufficiently large with respect to the diverse phosphor materials.If a host compound is unique in the dataset and is put in the validation data in a fold of the cross validation, the training data does not contain compounds like the unique host, resulting in a large prediction error.Another possible cause of the large prediction errors is the quality of the reported emission peak wavelengths.Some phosphor materials have a deviation of tens of nm or more in the reported emission peak wavelengths.Phosphors with large deviations have been eliminated from the dataset as mentioned in the Methods section, but the data might not be fully curated yet.Further investigation for the large prediction errors is beyond the scope of this study, whereas obtaining a high-quality dataset that covers diverse materials is a big issue in the data-driven materials research.
Emission peak wavelength is used as the target variable in this study, while the energy of the emission peak was used as the target variables in the previous studies. 6,7Note that in principle, correction of intensity is required to convert an emission spectrum from the wavelength to energy and vice versa, and its peak top shifts.For comparison with previous studies, the emission peak wavelengths were simply converted into energy without such intensity correction, and regression on the converted energy was conducted.The bagging of the GB method was used.The prediction accuracy and the plot of the predicted values with respect to the reported ones are shown in Table S4 and

Test with additional literature data
To develop new phosphor materials, the AtomWork-Adv materials database was explored and candidate host compounds of oxides, nitrides, and oxynitrides composed of main elements and containing Ca, Sr, or Ba elements were collected.Emission peak wavelengths of the collected compounds were predicted using the bagging of GB model that was rebuilt using the whole phosphor dataset with the optimized parameters.Compounds with predicted wavelengths of about 500-550 nm were selected as candidates of green-emitting phosphors.Some of the collected compounds had already been reported as Eu 2+ -activated phosphors, while they were not in the phosphor dataset.Therefore, an additional test was performed on the machine learning model with additional 21 Eu 2+ -activated phosphors.
The predicted and reported emission peak wavelengths of the additional 21 phosphors are illustrated in Fig. 3, which are overlaid on the cross-validation results (Fig. 2e).MAE and RMSE to the test data were 33 nm and 42 nm, respectively.The distribution of the prediction errors looks comparable with that for the validation data in the cross validation, but the MAE and RMSE were much larger than the values estimated by the cross validation.The test data contained Sr 2 GeO 4 :Eu 2+ , which looked like an outlier.Sr 2 GeO 4 :Eu 2+ showed the largest prediction error: 515 nm of the prediction versus 620 nm reported in ref. 27.This host compound contains Ge element, which was not in the phosphor dataset as shown in Fig. 1b.MAE and RMSE to the other 20 test data except Sr 2 GeO 4 :Eu 2+ were respectively 30 nm and 37 nm, which were comparable to the results from the cross validation.These suggest that it is essential to extend the phosphor dataset to cover the diverse phosphor materials for a higher predictive performance over a wide range of candidate compounds.

Exploration of new phosphor materials
As described in the previous section, oxides, nitrides, and oxynitrides composed of main elements and containing Ca, Sr, or Ba elements were collected from the AtomWork-Adv materials database to develop new phosphors.20 candidate compounds were selected by removing high-pressure phases and selecting compounds with predicted emission peak Fig. 3 Predicted emission peak wavelengths with respect to reported values for the test data of the additionally collected Eu 2+ -activated phosphors (green) using the bagging of the gradient boosted regression trees method.The plot is overlaid on the cross-validation results (Fig. 2e).

Conclusions
To rapidly discover new Eu 2+ -activated phosphors with a designed luminescence color, a machine learning model to predict emission peak wavelength was developed from the phosphor dataset composed of 129 Eu 2+ -activated phosphors.
The general-purpose compositional and structural features were used to represent host compounds.The bagging technique with the gradient boosted regression trees method was adopted to obtain high predictive performance against the nonlinear relationship between the features and the emission peak wavelength, and to avoid overfitting with the small phosphor dataset.The predictive performance of the built machine learning model was comparable to those in previous studies. 6,7The results of the cross validation and the additional test suggest that it is essential to extend the phosphor dataset to cover the diverse phosphor materials for a higher predictive performance over a wide range of candidate compounds.
Using the machine learning model, new green-emitting Eu 2+ -activated phosphors were searched from the AtomWork-Adv materials database.Among twenty candidate compounds predicted to have emission peak wavelengths of about 500-550 nm, three new phosphors, namely, Eu-doped Li showed simultaneous Eu 2+ and Eu 3+ luminescence, and it shows a blue-green emission derived from the Eu 2+ activators.These results clearly demonstrate that the machine learning on the emission peak wavelength is useful for the rapid and efficient development of new Eu 2+ -activated phosphors with a designed luminescence color.

Fig. 1
Fig. 1 (a) Histogram of emission peak wavelengths and (b) frequency of constituent elements of Eu 2+ -activated phosphors used in this study.S is a cation (S 6+ ) and an anion (S 2À ).N, O, F, Cl, Br, and I are anions.The other elements are cations.
Fig. S1 in the ESI.† The present results (0.13 eV MAE, 0.16 eV RMSE) are slightly smaller (better) than those in ref. 7 (0.139 eV MAE, 0.183 eV RMSE), and slightly larger (worse) than ref. 6 (0.020 eV 2 MSE corresponding to 0.14 eV RMSE).Only the features derived from the chemical composition were used in ref. 7, whereas features derived from the structure were also considered in ref. 6 and in this study.This would have resulted in the slightly poorer predictive performance in ref. 7.In ref. 6, the data were restricted to phosphors with only a single substitution site and to examples of the critical activator concentrations corresponding to concentrations showing the highest PL intensity.In contrast, some phosphors in the present dataset had multiple substitution sites and the activator concentrations depended on the literature.The restriction in ref. 6 might have suppressed the data variability and reduced the RMSE, but it also limited the coverage of the machine learning model.
2 Ca 4 Si 4 O 13 :Eu 2+ , Na 2 Ca 2 Si 2 O 7 :Eu 2+ , and SrLaGaO 4 :Eu 2+ .The results clearly demonstrate the power of the machine learning on the emission peak wavelength for rapid and efficient development of new phosphors with a designed luminescence color.

Table 1
Statistics of emission peak wavelengths of Eu 2+ -activated phosphors used in this study

Table 2
Mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R 2 ) of the machine learning models for the training and validation data in the cross validation.The scores were averaged among the folds of the cross validation.Standard deviations among the folds are shown in parentheses

Table 3
Compositions and space groups of candidate compounds, predicted emission peak wavelengths, and summary of experimental results.Multiple lines for a single composition denote that the candidate composition has polytypes.The space groups and predictions for the polytypes of the synthesized products are underlined Open Access Article.Published on 29 November 2022.Downloaded on 1/6/2023 2:11:53 AM.This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.Eu 3+ in the host lattices is attributed to the redox potential of the substituted Eu ions and the annealing conditions, whereas the annealing conditions were limited depending on the host compounds to prevent them from melting or decomposing.Even if Eu 2+ is stable in the hosts, the luminescence may be quenched if the energy levels of the Eu 2+ excited states overlap or are close to the conduction bands of the hosts.These are likely the reasons why many candidates have not exhibited Eu 2+ luminescence.At this moment, it is hard to predict the valence and energy level of the substituted Eu ion in the host and to predict appropriate synthesis conditions to obtain Eu 2+ , prior to synthesis.Prediction of these factors is also important for efficient development of new Eu 2+ -activated phosphors and is a future task.