Themed collection Data-driven discovery in the chemical sciences

24 items
Open Access Paper

Making the InChI FAIR and sustainable while moving to inorganics

The InChI standard facilitates chemical compound identification across platforms, with v1.07 fixing numerous issues and enhancing transparency via GitHub. This update aims to better represent molecular inorganic compounds, addressing previous limitations.

Graphical abstract: Making the InChI FAIR and sustainable while moving to inorganics
Open Access Paper

A critical reflection on attempts to machine-learn materials synthesis insights from text-mined literature recipes

Machine-learned regression or classification models built from historical materials synthesis datasets have limited utility in guiding the predictive synthesis of novel materials, but anomalous recipes can inspire surprising new synthesis strategies.

Graphical abstract: A critical reflection on attempts to machine-learn materials synthesis insights from text-mined literature recipes
Open Access Paper

Web-BO: towards increased accessibility of Bayesian optimisation (BO) for chemistry

Improving accessibility of data-driven optimisation for chemical tasks via a graphical user interface.

Graphical abstract: Web-BO: towards increased accessibility of Bayesian optimisation (BO) for chemistry
Open Access Paper

Large property models: a new generative machine-learning formulation for molecules

We have built the first transformers trained on the property-to-molecular-graph task, which we dub “large property models”. A key ingredient is supplementing these models during training with relatively basic but abundant chemical property data.

Graphical abstract: Large property models: a new generative machine-learning formulation for molecules
Open Access Paper

Data-efficient fine-tuning of foundational models for first-principles quality sublimation enthalpies

We present an accurate and data-efficient protocol for fine-tuning the MACE-MP-0 foundational model for a given system. Our model achieves kJ/mol in predicting sublimation enthalpies and below 1% error in the density of ice polymorphs.

Graphical abstract: Data-efficient fine-tuning of foundational models for first-principles quality sublimation enthalpies
Paper

Analysis of uncertainty of neural fingerprint-based models

Assessment of uncertainty estimates of neural fingerprint-based models by comparing deep learning-based models with combinations of neural fingerprints and classical machine learning algorithms that employ established uncertainty calibration methods.

Graphical abstract: Analysis of uncertainty of neural fingerprint-based models
Open Access Paper

Prediction rigidities for data-driven chemistry

We demonstrate the wide utility of prediction rigidities, a family of metrics derived from the loss function, in understanding the robustness of machine learning (ML) model predictions.

Graphical abstract: Prediction rigidities for data-driven chemistry
Open Access Paper

Sequence determinants of protein phase separation and recognition by protein phase-separated condensates through molecular dynamics and active learning

We investigate three related questions: can we identify the sequence determinants which lead to protein self interactions and phase separation; can we understand and design new sequences which selectively bind to protein condensates?; can we design multiphasic condensates?

Graphical abstract: Sequence determinants of protein phase separation and recognition by protein phase-separated condensates through molecular dynamics and active learning
Open Access Paper

How big is big data?

The advent of larger datasets in materials science poses unique challenges in modeling, infrastructure, and data diversity and quality.

Graphical abstract: How big is big data?
Paper

Specialising and analysing instruction-tuned and byte-level language models for organic reaction prediction

We evaluate FlanT5 and ByT5 across tokenisation, pretraining, finetuning and inference and benchmark their impact on organic reaction prediction tasks.

Graphical abstract: Specialising and analysing instruction-tuned and byte-level language models for organic reaction prediction
Open Access Paper

Modelling ligand exchange in metal complexes with machine learning potentials

We introduce a strategy to train machine learning potentials using MACE, an equivariant message-passing neural network, for metal–ligand complexes in explicit solvents.

Graphical abstract: Modelling ligand exchange in metal complexes with machine learning potentials
Open Access Paper

Are we fitting data or noise? Analysing the predictive power of commonly used datasets in drug-, materials-, and molecular-discovery

We derive maximum and realistic performance bounds based on experimental errors for commonly used machine learning (ML) datasets for regression and classification and compare them to the reported performance of ML models.

Graphical abstract: Are we fitting data or noise? Analysing the predictive power of commonly used datasets in drug-, materials-, and molecular-discovery
Open Access Paper

Discovery of highly anisotropic dielectric crystals with equivariant graph neural networks

We adopt the latest approaches in equivariant graph neural networks to develop a model that can predict the full dielectric tensor of crystals, discovering crystals with almost isotropic connectivity but highly anisotropic dielectric tensors.

Graphical abstract: Discovery of highly anisotropic dielectric crystals with equivariant graph neural networks
Paper

Accurate and reliable thermochemistry by data analysis of complex thermochemical networks using Active Thermochemical Tables: the case of glycine thermochemistry

Active Thermochemical Tables (ATcT) are employed to resolve existing inconsistencies surrounding the thermochemistry of glycine and produce accurate enthalpies of formation for this system.

Graphical abstract: Accurate and reliable thermochemistry by data analysis of complex thermochemical networks using Active Thermochemical Tables: the case of glycine thermochemistry
Open Access Paper

Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes

Leveraging natural language processing models including transformers, we curate four distinct datasets: tmCAT for catalysis, tmPHOTO for photophysical activity, tmBIO for biological relevance, and tmSCO for magnetism.

Graphical abstract: Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes
Open Access Paper

Predictive crystallography at scale: mapping, validating, and learning from 1000 crystal energy landscapes

We demonstrate the reliability and scalability of computational crystal structure prediction (CSP) methods for small, rigid organic molecules by performing in-depth CSP investigations for over 1000 such compounds.

Graphical abstract: Predictive crystallography at scale: mapping, validating, and learning from 1000 crystal energy landscapes
Open Access Paper

Integration of generative machine learning with the heuristic crystal structure prediction code FUSE

We integrate generative machine learning with heuristic crystal structure prediction in FUSE. The combined result shows superior performance over both components, accelerating the pace at which we will be able to predict and discover new compounds.

Graphical abstract: Integration of generative machine learning with the heuristic crystal structure prediction code FUSE
Open Access Paper

Beyond theory-driven discovery: introducing hot random search and datum-derived structures

Ephemeral Data-Derived Potential (EDDP)-driven long high-temperature anneals combined with AIRSS, termed as hot-AIRSS, enable the exploration of low-energy configurations of complex materials.

Graphical abstract: Beyond theory-driven discovery: introducing hot random search and datum-derived structures
Paper

Optical materials discovery and design with federated databases and machine learning

New hypothetical compounds are reported in a collection of online databases. By combining active learning with density-functional theory calculations, this work screens through such databases for materials with optical applications.

Graphical abstract: Optical materials discovery and design with federated databases and machine learning
Open Access Paper

Knowledge distillation of neural network potential for molecular crystals

Knowledge distillation worked to improve the neural network potential for organic molecular crystals.

Graphical abstract: Knowledge distillation of neural network potential for molecular crystals
Open Access Paper

Mapping inorganic crystal chemical space

We enumerate binary, ternary, and quaternary element and species combinations and present a two-dimensional representation of inorganic crystal chemical space, labelled according to whether the combinations pass standard chemical filters and if they appear in known databases.

Graphical abstract: Mapping inorganic crystal chemical space
Open Access Accepted Manuscript - Paper

How to do impactful research in artificial intelligence for chemistry and materials science.

Accepted Manuscript - Paper

Re-evaluating Retrosynthesis Algorithms with Syntheseus

Open Access Accepted Manuscript - Paper

Embedding human knowledge in material screening pipeline as filters to identify novel synthesizable inorganic materials

24 items

About this collection

We are delighted to share with you a selection of the papers associated with a Faraday Discussion on Data-driven discovery in the chemical sciences. More information about the related event may be found here: http://rsc.li/data-fd2024. Additional articles will be added to the collection as they are published. The final versions of all the articles presented and a record of the discussions will be published after the event.

The Discussion will involve four central themes – each focused on different aspects of chemical "discovery", and each aiming to promote the exchange of ideas between the molecular and materials communities: Discovering chemical structure, Discovering structure–property correlations, Discovering synthesis targets, Discovering trends in big data.

On behalf of the Scientific Committee, we hope you join us and participate in this exciting event, and that you enjoy these articles and the record of the discussion.

Spotlight

Advertisements