Molecular screening for solid – solid phase transition by machine learning

: Solid – solid phase transition is a mechanism manifesting and regulating functionalities in molecular crystals. The phase transitions are generally found by chance empirically, and screening molecules theoretically to predict their existence is challenging. In this study, we screened for the possibility of solid – solid phase transitions by positive-unlabeled learning using molecular descriptors and verified the inference by finding new substances exhibiting solid – solid phase transitions in the solid state. We also found that the molecular structure of a substance is weakly related to the transition temperature but not the transition enthalpy. The findings of this study are useful for designing functional molecular crystals.


Introduction
The functionalities of molecular crystals can change in the solid-solid phase transition, dened as the reversal point of the minimum Gibbs free energy between at least two solid states. 1 A solid-solid phase transition sometimes causes a signicant and dynamic change 2 in a property.For example, birefringence can change in a solid-solid phase transition due to the accompanying change in the electronic states of the crystal. 3[9][10] Despite the importance of solid-solid phase transitions, predicting their occurrence before actual experiments is currently difficult.3][14] This is because it is computationally costly to accurately calculate the Gibbs free energy of (assumed) solid states.Therefore, theoretical calculations cannot uncover the occurrence of solid-solid phase transitions in molecular crystals before actual experiments; they are found by chance aer many trial-and-error experiments.
An inductive approach may solve this situation.The method in question, known as materials informatics, has been mainly applied to inorganic and polymeric materials 15,16 and, recently, to molecular crystals as the next target. 17,18This motivated us to apply it to the search for a hidden trend of solid-solid phase transitions in molecular crystals (Fig. 1).We reduced the problem to using molecular structures without considering their crystal structures.Although this simplication introduces the limitation that we cannot incorporate the effect of intermolecular interactions, it has advantages for designing new molecules if we discover a relationship between a solid-solid phase transition and molecular structure.
In this work, we screened for the possibility of solid-solid phase transition using positive-unlabeled (PU) learning with molecular descriptors and found substances that exhibited solid-solid phase transitions in their crystalline states (Fig. 1).We accomplished this by constructing a positive dataset of thermally induced solid-solid phase transitions of molecular crystals, manually curated from published studies.The reason we focused on the thermally induced solid-solid transition was the number of available reports and the practical applications.Classication models were trained and then compared, and the best classier suggested molecules that potentially exhibit solid-solid phase transitions.Among them, we found solidsolid phase transitions by literature search and our experiments.We also performed regression analysis and found that transition temperature was weakly related to molecular structure.These ndings should be useful in designing molecular crystals that exhibit solid-solid phase transitions.

Dataset preparation
The information on the phase transition of molecular crystals was collected based on the criteria that the phase transition has been conrmed by thermal analysis of differential scanning calorimetry (DSC) and/or X-ray crystallography.Here, the phase transition includes reversible and irreversible crystal-to-crystal phase transitions, and we did not care whether it was in a single-crystal-to-single-crystal manner.We did not focus on cryogenic temperatures and collected data at a temperature above 120 K under atmospheric pressure.We also added the condition that permitted atoms are H, B, C, N, O, F, Si, P, S, Cl, Br, and I atom species.A total of 297 datasetswere extracted from 91 papers, and the number of unique molecules was 88.The molecular structure corresponding to each dataset was downloaded from the Cambridge Crystal Database Centre (CCDC) in simplied molecular input line entry system (SMILES) format.The numeric information on phase transition, temperatures (T endo and T exo ), and enthalpies (DH endo and DH exo ) at the endothermic and exothermic phase transitions, was also collected.When they were written as a specic value in the papers, the values were extracted.Instead, when they were written as a range in the papers, the average values were calculated.When they were not written as a value but shown in a gure in the papers, we read the values from the gures.We also collected crystal structure information: CCDC numbers and phase names, for future analysis.
For molecular crystals that were not conrmed to show phase transitions, we searched the data in the CSD using the following conditions: no report on phase transition, R-factor # 0.05, only organic, 3D coordinates determined, no disorder, not polymeric, and not the results of the powder study.We also added the condition that permitted atoms are H, B, C, N, O, F, Si, P, S, Cl, Br, and I atom species.This search was coded using the CSD Python API (v.3.0.14).The search raised 199 987 datasetsfrom the CSD (v.5.42).Then, the duplicate of SMILES was deleted within the unlabeled dataset and between the positive dataset and the unlabeled dataset.Finally, we obtained unlabeled data of 185 037 unique SMILESs.

Machine learning implementation
Molecular structures represented as SMILESs were converted into vectors.][21][22][23][24] For the classication task, we implemented positiveunlabeled (PU) and binary (BC) classications.In both cases, we used random forest (RF), support vector machine (SVM), neural network (NN), and gradient boosting decision tree (GBDT) as prediction models.PU learning was implemented by the weighted Elkanoto method. 25The positive and unlabeled datasets were used for both PU and BC tasks.
Classication models were evaluated based on the product of true positive rate (TPR) and selection effect (SE) on the average of 10-fold cross-validation (CV).The TPR is dened as the proportion of the number of predicted positives in positive data (n pp ) to the number of positive data (n p ): We dened the SE as the multiplier of the number of unlabeled data (n u ) to the number of predicted positives in unlabeled data (n up ) for the selection purpose: Here, n p and n u were 88 and 185 037, respectively.The product of TPR and SE is represented as When n p and n u are xed, this product depends on n pp and n up .The better the performance of the model, the higher the n pp should be, and the more useful the model as a suggester, the lower the n up should be.The model with a higher value of the product is expected to perform molecular screening.The rate of predicted positives in the unlabeled dataset, i.e., the reciprocal of the SE, was not used because the higher metric value (max. 1) does not mean better screening.This is why we used the product of TPR and SE as the score function even though the range of the metric was large.Hyperparameters of the prediction models are summarized in Table S1.† For regression, we used the NN, RF, and transfer learning NN (TL-NN) as prediction models.A 5-fold CV was performed ve times because the number of positive datasetswas small, and the results were inuenced by the data split.The mean absolute error (MAE) was calculated on the average of ve 5-fold CVs.Hyperparameters of the prediction models are summarized in Table S1.† In the TL-NN, the scratch model was trained on a larger dataset of melting points (n = 22 404) 26 through hyperparameter optimization, and then transferred to learn transition temperature and enthalpy.Fine-tuning was performed by increasing the number of trainable layers from the nearest to the output.

Material preparation and characterization
The compound 1,4-bis(3,5-di-tert-butyl-2hydroxybenzylideneaminomethyl)benzene (the crystal of OCA-PAK 27 in the Results section) was synthesized by mixing 3,5-ditert-butylsalicaldehyde and p-xylylenediamine in a molar ratio of 2 : 1 in 2-propanol and by heating for 1 h at 423 K using a microwave.Differential scanning calorimetry was performed using a DSC 8500 (PerkinElmer) in the temperature range of 223-523 K at a speed of 10 K min −1 .Powder X-ray diffraction analysis was performed using a Rigaku Ultima III diffractometer, equipped with monochromatic Cu Ka irradiation (l = 1.54187Å) at 40 kV and 40 mA.The solid appearance was observed using an optical microscope equipped with a camera (WRAYCAM-NF300, Wraymer).

Results and discussion
In the collected dataset, the total number of solid-solid phase transitions was n = 297, and the unique SMILES was n = 88.The molecular structures were well diverse (Fig. S1 †).The transition temperatures and enthalpies of the collected positive data are also summarized in Fig. S2.† The unlabeled data, meaning the solid-solid phase transition in the molecular crystal not recognized in CSD, were collected from the CSD by ltering several conditions (see the Method section).The CSD search resulted in 185 037 unique SMILESs as the unlabeled dataset.
Positive and unlabeled datasets were used for positiveunlabeled (PU) and binary (BC) classications.BC is a task commonly used for determining a discriminant boundary between positive and negative data.However, this problem setting should not be appropriate for the current case because we do not know all the true negatives of the solid-solid phase transition.We cannot obtain the thermal analysis results of all molecular crystals in CSD.In this case, PU classication is a more reasonable setting to nd a discriminant boundary between positive and unlabeled data.We can determine the possibility of solid-solid phase transition in unlabeled data by using the PU setting to predict the synthesizability of materials. 28,29irst, we compared TPRs between PU and BC tasks to rationalize the PU setting (Table S2 †).All combinations solved as PU yielded a much higher TPR than BC.This result indicated that the PU task is more reasonable in this work and that the discriminant boundary in BC failed to predict true positives due to improper problem setting and imbalanced size of positive and unlabeled data.
Among the PU results, Avalon-SVM was recognized as the best classier and suggester based on the highest value of the product of TPR and SE (Table 1).There were some reasons why other models were worse than Avalon-SVM.For example, ECFP-SVM with the second highest metrics had a lower TPR, meaning less model validity (Table S3 †).Mordred-GBDT and RDKit-GBDT showed higher TPRs, but the products with SEs were much lower due to low SEs (Table S3 †).A low SE means that the model predicted most unlabeled molecules to be positive.In such a case, it is difficult for us to select candidate molecules to be investigated next in detail, and the model should not work as a suggester.Therefore, the Avalon-SVM was used for molecular selection.Here, we did not interpret the model due to the interpretability difficulties of Avalon and used it for the screening purpose.
The suggester must nd molecules likely to exhibit solidsolid phase transitions.The unlabeled data were input into the trained 10 models obtained by 10-fold CV, and the number of times they were predicted to be positive divided by 10 was used as the probability of the solid-solid phase transition.Therefore, probabilities are obtained discretely such as 0.1, and 0.2.Most of the unlabeled data were predicted to have p = 0.1, while 113 substances resulted in higher probabilities of p $ 0.2 (Table 2).Therefore, the 113 substances were checked in detail to see if any phase transitions were reported in the literature that were not recognized in the CSD.1][32] For example, the crystal of KUDDUK02 (CSD code) has been reported to transform from a into b irreversibly at 342.9 K upon heating. 31In addition, we experimentally prepared the crystal reported as OCAPAK 27 and found the irreversible crystal-to-crystal phase transition upon 450 K (Fig. S3 †).This phase transition has not been reported anywhere, and the novel solid phase transition was found owing to the molecular screening.Furthermore, although the solid phase transition of the crystal of AREDIN has not been described in the literature, 30 the DSC curve of the compound displayed a small endothermic peak upon heating without weight loss before decomposition, suggesting a potential phase transition.
For molecules with p = 0.2, we identied 6 substances as positive among 99 suggested molecules (Fig. S4 †).We also identied 10 substances as being negative at least in the temperature range of DSC measurement.Determining whether positive or negative required the description and/or gure of DSC measurements.The reason why most substances are still unlabeled is that new crystal structures are oen reported in papers focusing on organic synthesis.In such cases, the conduction of DSC measurement is rare, and a gap in the available amount of data between the crystal structure and thermal measurement is generated.This situation should also support the rationality of PU learning.
These above results showed that at least 3/14 (21.4%) compounds with p $ 0.3 and 9/113 (8.0%) compounds with p $ 0.2 were positive.These positive rates are higher than the positive rate used in model training (88/(185 037 + 88) = 0.05%).Moreover, the CSD contained 532 unique molecules assigned with the word "phase transition", and the occurrence of phase transition in the CSD is 0.29% (=532/(185 037 + 532)), which is much lower than the occurrence from suggested substances.This insists that potential solid phase transitions are hidden even in known crystals, and the machine learning model constructed by PU learning provides us guidance for molecular selection.
Table 2 The distribution of the number of substances in the unlabeled dataset and the predicted probability While the model worked well as a suggester as mentioned above, there is a limitation to the adaptability of the model.Because the discriminant boundary in PU learning is affected strongly by the positive data, the suggested substances among the unlabeled molecules should have similar features to positive ones.Data points of suggested molecules with p $ 0.2 are located near those of positive data as evidenced by the 2D visualization of the Avalon descriptor using t-distributed stochastic neighbor embedding (t-SNE), which is a typical method of manifold (Fig. 3).This result insists that the suggestion is limited to the known positive data.We calculated average distances between known positives and predicted positives and between known positives and the unlabeled data.The former was 1.68, and the latter was 4.71 in the embedding space, showing quantitatively that predicted positives are closer to known positives than other unlabeled data.From a different perspective, the addition of positive data will broaden the variety of suggested molecules.There is also a limitation in that the difference between polymorphs cannot be distinguished because molecular descriptors of the same molecule are encoded into the same vector.The incorporation of crystal structure information will improve model accuracy, and this kind of model extension should be tackled in the future.
Next, we performed the regression of transition temperature and enthalpy.The motivation was to determine whether transition temperature and enthalpy are related to the molecular structure.If this regression captures some hidden relationship, we can interpret which molecular substructure inuences transition temperature and enthalpy and design molecules potentially expressing the solid-solid phase transition at an expected temperature.The positive dataset we manually collected made this analysis possible.In regression, target properties were endothermic and exothermic temperatures (T endo and T exo ) and the corresponding transition enthalpies (DH endo and DH exo ).When multiple solid-solid phase transitions corresponded to a substance, the maximum and minimum values were used in independent regressions (denoted as, for example, T endo(max) and T endo(min) ).
We show the regression results using the Mordred descriptor, which was better than other molecular descriptors in regression (Table S4 †).For T endo(max) , the mean model resulted in a mean absolute error (MAE) = 75.4(K), which is the criterion for no relationship between T endo(max) and the molecular structure (Table 3).The NN and the transfer NN (TL-NN) models yielded worse metrics than the mean model (Table S4 †), whereas RF outperformed the mean model (Table 3).The same trend, RF better than the mean model, was also observed for T endo(min) , T exo(max) , and T exo(min) (Table S4 †).These results suggested that Mordred-RF captured a hidden weak trend between transition temperature and molecular structure.The scatter plot of experimental and predicted values also supports this result because orange points are distributed almost along the reference line (Fig. 4).On the other hand, there was no clear trend for the regression of DH endo and DH exo (Tables 3 and S4 †).Although enthalpy and entropy are directly related to transition temperature in thermodynamic physics, we did not obtain the relationship between transition enthalpy and molecular structure.This probably results from the relatively larger deviation of enthalpy and the smaller number (n ∼ 40) of datasets compared with that of the transition temperature (n ∼ 80).
To further validate the regression results of the transition temperature, the generalization ability of the regression model was checked using positive data found in the unlabeled dataset.Among 9 substances found by molecular screening, 8 datasets were available for the inference of T endo(max) .A comparison of experimental and predicted values shows that 6 predicted values were well matched with the experimental values, although 2 predicted values resulted in larger errors (Fig. 4).
Even though this inference afforded MAE = 57.1 (K), which is in good agreement with the regression result in Table 3, this validates that transition temperature is weakly related to the molecular structure.
To interpret which molecular substructure inuences the transition temperature, we employed two different approaches for model interpretation.One is the feature importance obtained from the RF model, and important features can be identied by their magnitude in reducing the MAE.The other method is Shapley additive explanations (SHAP). 33This method calculates the SHAP value for each feature of each dataset, and the distribution of SHAP values for each feature affords which molecular feature has a positive or negative effect on the target variable.
First, we show the distribution of SHAP values for T endo(max) , T endo(min) , T endo(max) , and T endo(min) (Fig. 5a-d).Here, the top 10 features among 946 features aer sorting by the averaged SHAP value are shown for each target variable.The commonly ranked feature was VSA_Estate4, which is dened as the sum of the electrotopological state values of atoms in the molecule with the van der Waals surface area between 5.41 and 5.74. 24In all cases, low feature values (shown as blue points) of VSA_EState4 tended to distribute in the negative region of SHAP values and high feature values (red points) tended to distribute in the positive region of SHAP values.This result suggests that higher VSA_EState4 tends to increase transition temperature.This interpretation can be rationalized by the scatter plot of VSA_EState4 and T endo(max) (Fig. 5e).Although there is an outlier, a roughly positive correlation between VSA_EState4 and T endo(max) was observed.The tendency is also applied to positive data found in the unlabeled dataset (Fig. 5e).VSA_EState4 was also ranked in the top 10 based on the feature importance of RF (Table S5 †), and thus this feature should have the largest inuence on the transition temperature.
There are 4 other features common for T endo(max) and T endo(min) and 2 features common for T exo(max) and T exo(min) based on SHAP values (Fig. 5a-d).Three out of 4 features common to T endo(max) and T endo(min) were also ranked in the top 10 based on the feature importance of RF (Table S5 †).Thus, ATSC2d, AATS0p, and ATSC3v should affect the temperature of endothermic transition, and their effects should be positive based on SHAP values (Fig. 5a and b).In the same way, 2 features common to T exo(max) and T exo(min) raised by SHAP values were also ranked in the top 10 based on the feature importance of RF (Table S5 †).ATSC1are and Xc-3d should affect the temperature of exothermic transition, and their effects should be positive and negative, respectively, based on SHAP values (Fig. 5c and d).Although intuitive chemical understanding is difficult for these important features, we can calculate each feature value of a molecule and obtain prior knowledge of whether the phase transition temperature is likely to be high or low.

Conclusions
In summary, we have successfully screened molecules for solidsolid phase transitions aided by PU learning and veried it by nding solid phase transitions of suggested substances.The positive rate of suggested substances was 8.0%, which is much higher than the rate used for model training and the rate in the database.This result validated the effectiveness of the current workow, although there is a limitation in that the suggestions are around known positive data.We also found a hidden relationship between the molecular structure and transition temperature by regression.A feature, VSA_EState4, was raised as a commonly important feature for the temperatures of endothermic and exothermic transitions.The effect should be a positive relationship, and some other features were also found.Although this analysis neglected intermolecular interactions and 3-dimensional conformation, the obtained insight enables us to efficiently nd molecules manifesting a solidsolid phase transition in the crystal.This work will aid in the design of functional molecular crystals for the future applications of organic optoelectronics and actuators.

Fig. 1
Fig. 1 Prediction of the solid-solid phase transition using a molecular structure in a molecular crystal composed of a compound with CSD codes URUBOA05, 06.Disordered molecules with minor occupancy were omitted for clarity.The bold arrow shows machine learning workflow: dataset construction of positive and unlabeled data, prediction task of positive-unlabeled classification which outputs the possibility of a solidsolid phase transition, and regression of transition temperature and enthalpy.

Fig. 2
Fig.2Suggested substances with p $ 0.2 by PU prediction.Unidentified means we did not find the DSC result of the crystal and we cannot conclude whether it is positive or negative.Positive means we confirmed the solid-solid phase transition of the crystal according to the literature and experiments.Potential means that the solid-solid phase transition was not reported in the literature, but the thermal profile of DSC showed peak-like behavior before reaching the melting point.CSD codes are also supplied to identify crystal data.Molecules with p = 0.2 are omitted due to the limitation of space and are supplied in the ESI.†

Fig. 3
Fig. 3 Two-dimensional visualization of Avalon vectors embedded by t-SNE.Red and orange points represent 88 positive datasetsand 113 suggested molecules, respectively.Sky-blue points represent 500 unlabeled datasets, randomly sampled for clarity.

Fig. 4
Fig. 4 Scatter plot of experimental and predicted values of T endo(max) .The dashed black line represents the reference line when predicted values are perfectly matched with experimental values.Molecular structures of two outliers are shown.

Table 1
Comparison of the evaluation criterion, the product of TPR and SE, obtained by PU learning a The metrics of RDKit-SVM was not obtained because n up was zero.

Table 3
Regression performance using the Mordred descriptor