Issue 6, 2022

A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing

Abstract

The knowledge of mixtures’ phase equilibria is crucial in nature and technical chemistry. Phase equilibria calculations of mixtures require activity coefficients. However, experimental data on activity coefficients are often limited due to the high cost of experiments. For an accurate and efficient prediction of activity coefficients, machine learning approaches have been recently developed. However, current machine learning approaches still extrapolate poorly for activity coefficients of unknown molecules. In this work, we introduce a SMILES-to-properties-transformer (SPT), a natural language processing network, to predict binary limiting activity coefficients from SMILES codes. To overcome the limitations of available experimental data, we initially train our network on a large dataset of synthetic data sampled from COSMO-RS (10 million data points) and then fine-tune the model on experimental data (20 870 data points). This training strategy enables the SPT to accurately predict limiting activity coefficients even for unknown molecules, cutting the mean prediction error in half compared to state-of-the-art models for activity coefficient predictions such as COSMO-RS and UNIFACDortmund, and improving on recent machine learning approaches.

Graphical abstract: A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing

Associated articles

Supplementary files

Article information

Article type
Paper
Submitted
10 Jun 2022
Accepted
27 Sep 2022
First published
29 Sep 2022
This article is Open Access
Creative Commons BY-NC license

Digital Discovery, 2022,1, 859-869

A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing

B. Winter, C. Winter, J. Schilling and A. Bardow, Digital Discovery, 2022, 1, 859 DOI: 10.1039/D2DD00058J

This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence. You can use material from this article in other publications, without requesting further permission from the RSC, provided that the correct acknowledgement is given and it is not used for commercial purposes.

To request permission to reproduce material from this article in a commercial publication, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party commercial publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements