Open Access Article
Yoshiaki Uchida†
*a,
Shizuo Kaji†
b and
Naoto Nakano†
c
aGraduate School of Engineering Science, The University of Osaka, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan. E-mail: y.uchida.es@osaka-u.ac.jp
bGraduate School of Science, Kyoto University, Kitashirakawa Oiwake-cho, Sakyo-ku, Kyoto, Japan
cSchool of Interdisciplinary Mathematical Sciences, Meiji University, 4-21-1 Nakano, Nakano-ku, Tokyo 164-8525, Japan
First published on 4th May 2026
Experimental data often contain anomalies, which can be errors or previously unrecognised knowledge gaps. While errors undermine the reliability of reported findings, unknown gaps can sometimes point to opportunities for discoveries. Machine learning (ML) techniques offer a promising means of identifying such anomalies. In this study, we propose a human-in-the-loop approach that integrates domain expertise and an ML model trained on a comprehensive database of phase transition behaviours of liquid crystalline (LC) materials (LiqCryst 5.2) to scrutinise data integrity. The ML model uncovered multiple anomalies in reported chemical data on LC phase transition behaviours, which were subsequently re-examined by human experts to determine whether they were due to errors. Our results demonstrate that the ML model can effectively detect inconsistencies even within a large-scale database widely regarded as an industry standard. At the same time, anomalies that do not originate from errors may highlight unexplored phenomena and thereby stimulate future discoveries. The proposed methodology for systematically reassessing reported chemical data has the potential to be applied broadly across different materials systems and scientific domains.
Error detection is a classical yet enduring problem across scientific disciplines.7 In recent years, machine learning (ML) approaches have proven effective for identifying inconsistencies in large, human-curated datasets, with notable examples including applications to Wikipedia and other knowledge bases.8 In materials science, ML techniques have advanced rapidly, enabling highly accurate prediction models trained on large-scale datasets.9,10 These models implicitly encode chemical and physical information present in the data, and can therefore be used to perform meta-analysis to detect anomalous patterns and inconsistencies.5,6 However, ML techniques alone cannot fully resolve the nature of detected anomalies, as they cannot distinguish between erroneous data and genuinely unknown phenomena. Interpreting such discrepancies requires domain-specific knowledge and contextual understanding of how data are generated and reported. Incorporating human expertise through a human-in-the-loop framework allows these anomalies to be examined in context, providing deeper insight into their origins, as shown in Fig. 1. In this work, we demonstrate such a framework for systematic error detection in materials data, illustrated through several representative case studies.
ML models for liquid crystalline (LC) systems have been developed to predict phase transition behaviour.10–12 However, previous studies have predominantly relied on carefully curated, small-scale datasets comprising at most a few thousand molecules, where obvious defects are minimised. While such datasets are suitable for model benchmarking, they do not reflect the heterogeneity and noise inherent in real-world data. Recent studies have begun to extend ML-based research on LC materials beyond small, carefully curated datasets, including large-scale analyses and experimentally validated discovery pipelines.13–15 In this context, our focus is complementary: model-guided detection of database inconsistencies followed by expert re-examination. A key feature of this study is the use of a large-scale database containing over 100
000 substances,16 which inevitably includes inhomogeneities and various types of anomalies. By training an ML model on this dataset, we construct a predictive model that generalises across a broad range of LC materials while maintaining high accuracy. Leveraging this model, we perform systematic anomaly detection within the database, demonstrating that large-scale, heterogeneous data can be effectively utilised not only for prediction but also for identifying inconsistencies.
889 molecules. Model performance was evaluated using 5-fold cross-validation to ensure robustness against dataset variability; in each fold, approximately 80% of the data (ca. 35
111 molecules) were used for training and 20% (ca. 8778 molecules) for testing. All scores were computed from out-of-fold predictions obtained in the 5-fold cross-validation. Thus, each molecule was evaluated by a model that was not trained on that molecule.
The model is based on a graph neural network (GNN), which is well-suited for molecular representation learning due to its ability to capture relational structure beyond handcrafted descriptors. Our implementation is publicly available (https://github.com/shizuo-kaji/LC_QSPR). Specifically, we developed an ML model that predicts both the existence and the temperature ranges of LC phases (Fig. 2). The model achieves classification accuracies exceeding 92%, outperforming previously reported approaches for coarse phase classification (typically 80–90%), as summarised in Table 1.11,12 Furthermore, the prediction error for transition temperatures is within approximately 5%, despite the substantial variability in experimental conditions—such as differences in measurement equipment, sample purity, and heating rates—which can each introduce errors of several degrees Celsius. This level of predictive accuracy enables the reliable identification of anomalous entries in the database.
| Phases | NVa | Accuracy [%] | Recall | Precision | F1 Score | MAEb (T+) | Stdc (T+) | MAEb (T−) | Stdc (T−) |
|---|---|---|---|---|---|---|---|---|---|
| a The numbers of molecules showing the specific phases.b Mean absolute error.c Standard deviation. | |||||||||
| N, N* | 20662 | 92.4 | 0.92 | 0.92 | 0.92 | 7.39 | 11.4 | 11.0 | 12.5 |
| SmA | 9531 | 93.6 | 0.85 | 0.86 | 0.85 | 8.40 | 12.3 | 10.1 | 12.3 |
| SmB | 1710 | 97.9 | 0.69 | 0.76 | 0.72 | 10.7 | 12.8 | 13.3 | 15.3 |
| SmC, SmC* | 6444 | 96.1 | 0.87 | 0.87 | 0.87 | 8.00 | 11.0 | 8.77 | 10.7 |
| SmF, SmF* | 307 | 99.7 | 0.72 | 0.81 | 0.76 | 6.20 | 8.64 | 9.11 | 11.5 |
| SmG | 328 | 99.5 | 0.63 | 0.72 | 0.67 | 8.54 | 11.9 | 10.1 | 10.9 |
| SmH | 52 | 99.9 | 0.56 | 0.62 | 0.59 | 13.1 | 19.0 | 14.0 | 13.9 |
| SmI, SmI* | 274 | 99.7 | 0.75 | 0.77 | 0.76 | 6.13 | 16.3 | 10.1 | 18.0 |
| Colh | 341 | 99.7 | 0.78 | 0.78 | 0.78 | 15.7 | 19.1 | 17.6 | 20.0 |
Two task-specific fully connected heads are used: one for phase classification and one for temperature regression. The model is trained using a combined loss function consisting of focal loss for classification19 and L1 loss for regression. The focal loss addresses the severe class imbalance caused by rare LC phases, which would otherwise bias the model towards dominant phases under standard cross-entropy. For regression, L1 loss is preferred over L2 loss because the dataset contains non-negligible outliers in reported transition temperatures; L2 loss would excessively penalise such outliers and destabilise training, whereas L1 loss provides greater robustness. Optimisation was performed using the NAdam optimiser,20 over 300 epochs. This optimiser combines the benefits of adaptive learning rates and Nesterov momentum, leading to stable convergence in practice. Training required approximately 10 hours. Additional implementation details and hyperparameters are provided in the online repository.
The prediction task was formulated to jointly estimate (i) the existence of each phase and (ii) the corresponding transition temperatures, enabling a consistent description of phase sequences and their thermal behaviour.
Certain phases—namely smectic E (SmE), smectic J (SmJ), and chiral SmJ (SmJ*)—were excluded from the analysis. In the database, the notation for these phases is inconsistent with that used in the original literature, leading to ambiguity in their interpretation. To avoid introducing label noise that could adversely affect model training and evaluation, these phases were not considered as prediction targets.
Within the definition of the prediction targets, chiral and achiral variants were not treated as separate classes. Specifically, the pairs N/N*, SmC/SmC*, SmF/SmF*, and SmI/SmI* were grouped together when defining the phase labels. This choice is motivated by the fact that, within Landau–de Gennes theory of the N–Iso phase transition, the same primary order-parameter framework applies to both N and N* phases; chirality enters mainly through elastic couplings rather than by defining a fundamentally distinct mesophase class at this level of description. A similar argument applies to SmC, SmF, and SmI phases, whose chiral counterparts are often obtained by doping achiral hosts with small amounts of chiral additives. While chirality introduces additional features, such as helix-related signatures in physical properties, phase stability and transition temperatures are typically governed by those of the corresponding achiral host.
To assess whether this treatment affects the results, we performed an additional robustness analysis in which each pair (N/N*, SmC/SmC*, SmF/SmF*, and SmI/SmI*) was reformulated as a three-class problem: achiral, chiral, or absent. The resulting metrics are summarised in Table S1. This analysis confirms that the conclusions remain unchanged: the performance reported in Table 1 is not primarily driven by the merging of chiral and achiral labels. Instead, the dominant limitation arises from class imbalance, particularly for the less common smectic families.
773 substances. Such a large-scale dataset inevitably includes errors, necessitating careful curation for reliable analysis. At the same time, excessive manual intervention can introduce bias, potentially improving performance on the given dataset while reducing generalisability. In this study, our objective is not only to achieve high predictive performance but also to construct a model that captures general characteristics of LC materials from a heterogeneous dataset. This motivates a rule-based and reproducible preprocessing procedure.
Each substance in LiqCryst 5.2 is associated with multiple entries describing its structure and properties. We focused on entries containing molecular structures and phase transition sequences. Molecular structures are provided in several formats (e.g., images, MOL files, line notations, and SMILES). We used SMILES representations because they are well-suited for computational processing. Molecules lacking valid SMILES or containing duplicates were removed; most duplications arose from non-isomeric representations of chiral molecules.
We further defined a reproducible filtering protocol based on phase-sequence criteria and the presence of inorganic or otherwise out-of-scope compounds. Specifically, we imposed a minimum combined count of 12 carbon and nitrogen atoms, approximating the size of two six-membered rings. In addition, all entries containing metallic elements were excluded. These exclusions are intended as a rule-based domain definition and data-quality control, rather than implying that all excluded compounds are intrinsically incapable of exhibiting LC behaviour.
Phase transition sequences are recorded in the “Phases” field, including both lower and upper transition temperatures (e.g., “30 N 48”). We retained only entries with both limits available for the target phases, ensuring consistent comparison with model predictions. Entries with transition temperatures of 0 °C or above 800 °C were excluded as unreliable. We also removed entries containing “0 × 0” or “0 0”, which indicate monotropic behaviour.
We included clearing points to isotropic phases (“is”), extrapolated values (“ex”), decomposition temperatures (“dec”), and modified clearing points (“chg”), while excluding ambiguous annotations (“un”, “no”). For melting points, only values associated with crystalline phases (“Cr”, “Cr1”) were used; values associated with polymorphic or glass transitions (“Cr′”, “Cr2”, “Tg”) were excluded.
Special care was taken in handling ambiguous phase labels such as X (unidentified), SmX (uncategorised smectic), and Colx (uncategorised columnar). For example, the entry “Cr 20 × 65 N 72 is” indicates that during the heating process, the substance melts at 20 °C, enters an unknown (unidentified) phase X, transitions to the N phase at 65 °C, and finally transitions to the isotropic phase at 72 °C. For each target phase, entries were excluded if the phase was absent while ambiguous phases of the same class were present, to avoid label uncertainty. Additionally, closely related but distinct phases (e.g., discotic nematic, twist grain boundary phases, antiferroelectric or ferrielectric variants, and non-hexagonal columnar phases) were treated as separate and excluded when necessary to maintain consistency in phase definitions.
After applying these curation steps, we obtained a dataset of 43
889 molecules with valid SMILES, phase assignments, and transition temperatures. The numbers of molecules exhibiting each phase (NV) were as follows: N and N* (20
662), SmA (9531), SmB (1710), SmC and SmC* (6444), SmF and SmF* (307), SmG (328), SmH (52), SmI and SmI* (274), and Colh (341).
We emphasise that organometallic compounds were excluded throughout the analysis for all phases; this clarification is particularly relevant for Colh, where such species are more commonly reported.
We first examined compounds exhibiting large discrepancies in transition temperatures. Using the mean and standard deviation (σ) of prediction errors, we identify outliers for the upper transition temperature of nematic phases (TN+). Among compounds reported to exhibit N or N* phases, 13 entries show deviations exceeding the mean error (7.39 °C) by more than 10σ (σ = 11.4 °C), indicating statistically extreme inconsistencies. Examination of the original literature reveals that at least 4 of these entries contain clear errors, while 6 are consistent with the reported data and 3 could not be verified due to unavailable references. Notably, a substantial fraction of these extreme outliers corresponds to genuine errors.
A representative example is shown in Fig. 3a, where the predicted TN+ is 58.5 °C, whereas the database reports 388.8 °C, resulting in a discrepancy of −330.3 °C.22 Interestingly, an alternative reference for the same compound in LiqCryst 5.2 reports TN+ = 50 °C,23 which is consistent with the prediction. Closer inspection reveals that the compound described in the former reference differs from that recorded in the database (Fig. 3b), indicating a transcription error. This example highlights how inconsistencies can arise from incorrect mapping between chemical structures and literature sources, and how such errors can be effectively detected through model-based screening.
We next examined discrepancies in phase existence. Among 18
185 compounds reported to exhibit N phases, 92 are predicted with high confidence (>0.99) not to exhibit such phases. Detailed inspection of the original papers reveals multiple types of errors. In one case (Fig. 3c), a compound reported as nematic in the database is described in the original paper as exhibiting a smectic A (SmA) phase with identical transition temperatures, indicating a transcription error.24 In another example (Fig. 3d and e), the molecular structure recorded in the database differs from that in the original publication: a nitrogen atom is replaced by a methine group.25 This subtle structural discrepancy is sufficient to alter phase behaviour, demonstrating that the ML model captures chemically meaningful distinctions at a level consistent with expert intuition. Similar structural inconsistencies were identified for multiple compounds within the same source, suggesting systematic transcription errors.
These observations indicate that high-confidence discrepancies identified by the model are strongly enriched in genuinely erroneous entries, making them effective candidates for targeted data validation.
We also investigate cases with lower prediction confidence, which provide complementary insights. Among 2477 compounds reported to exhibit N* phases, 162 are predicted not to exhibit N* phases with relatively low confidence (<0.80). One such example (confidence 0.73) reveals a more complex origin of inconsistency. While the database and the cited paper report identical phase behaviour,26 further examination of the original conference abstract shows that the compound does not exhibit an N* phase (Fig. 3f).27 Instead, the reported phase behaviour corresponds to an analogue with a shorter alkyl chain (Fig. 3g), whose data appear to have been incorrectly propagated into the database. This case illustrates that some anomalies arise not from simple transcription errors, but from more intricate chains of misattribution across multiple sources.
Importantly, such multi-step inconsistencies are difficult to identify without combining ML-based anomaly detection with detailed human investigation. This example underscores the complementary roles of statistical detection and domain expertise in understanding the provenance of anomalous data.
Importantly, not all anomalies necessarily correspond to errors. Some discrepancies may instead reflect unknown or poorly understood phenomena, although identifying such cases lies beyond the scope of the present study. This highlights the dual role of anomaly detection in both data validation and the potential for discovery. Distinguishing between these possibilities, however, remains a challenging task that cannot be resolved solely by predictive models.
The present analysis was conducted on 43
889 of 107
773 LiqCryst entries (40.7%) that satisfied our curation criteria. Within this curated subset, 13 of 20
662 N/N* entries (0.063%) were identified as extreme TN+ outliers, of which 4 were confirmed as errors, 6 were consistent with the cited data, and 3 could not be verified. For phase-label discrepancies, 92 of 18
185 N-labelled records and 162 of 2477 N*-labelled records constituted the high-confidence and lower-confidence candidate sets, respectively. These figures describe screened candidate subsets rather than the overall prevalence of database errors.
Our study illustrates that combining ML-based meta-analysis with human expertise provides a practical framework for interrogating large, heterogeneous scientific datasets. Future work will focus on incorporating explainable modelling approaches to better characterise the origins of anomalies and to further bridge the gap between data-driven detection and scientific interpretation.15
Supplementary information includes additional robustness results (Table S1) for the three-class formulation of chiral/achiral/absent phase labels and supporting details relevant to model evaluation. See DOI: https://doi.org/10.1039/d6sm00257a.
Effect of Molecular Structure of Azobenzene Molecules, Chem. Mater., 2001, 13(9), 2807–2812, DOI:10.1021/cm0008967.Footnote |
| † Equal contribution. |
| This journal is © The Royal Society of Chemistry 2026 |