Transfer learning for a foundational chemistry model

Data-driven chemistry has garnered much interest concurrent with improvements in hardware and the development of new machine learning models. However, obtaining sufficiently large, accurate datasets of a desired chemical outcome for data-driven chemistry remains a challenge. The community has made significant efforts to democratize and curate available information for more facile machine learning applications, but the limiting factor is usually the laborious nature of generating large-scale data. Transfer learning has been noted in certain applications to alleviate some of the data burden, but this protocol is typically carried out on a case-by-case basis, with the transfer learning task expertly chosen to fit the finetuning. Herein, I develop a machine learning framework capable of accurate chemistry-relevant prediction amid general sources of low data. First, a chemical “foundational model” is trained using a dataset of ∼1 million experimental organic crystal structures. A task specific module is then stacked atop this foundational model and subjected to finetuning. This approach achieves state-of-the-art performance on a diverse set of tasks: toxicity prediction, yield prediction, and odor prediction.


Code Availability:
The base model latent space parameters and finetuned model parameters are available on the GitHub repository (https://github.com/emmaking-smith/Modular_Latent_Space/tree/master).Cleaned, opensource datasets and all code have also been made available.Access to the Cambridge Crystallographic Data Centre's (CCDC) Cambridge Structural Database (CSD) Python API and data can be accessed by entering a private access agreement with CCDC.

A Non-Expert's Guide to Transfer Learning (CliffsNotes Version):
Transfer learning is the process whereby the information gathered from one source (pretraining dataset) is used to "jump start" the learning from a second source (finetuning dataset).Typically, the finetuning dataset is the desired prediction target.If the finetuning dataset is small and data augmentation through other sources or experimental design is not possible or cost prohibitive, transfer learning is a potential solution.It allows the user to use a bigger model with the bigger pretraining dataset, which may lend itself to better results on the finetuning dataset task (finetuning task).Transfer learning may also be thought of as a way pointing the model in the correct direction.
Practically, the transfer learning is as follows.A machine learning model is trained on the pretraining dataset to predict a pretraining task.The final layer, used to formulate the pretraining task predictions, is removed and replaced with a new layer that will be used to predict the finetuning dataset task.The previous neural network layers retain the information learned from the pretraining dataset task.From here, the user may opt to "freeze" the pretrained layers or to re-train the whole system.Freezing the layers allows for small datasets to be used with deep neural networks.Re-training the whole system may be more beneficial if both datasets are sufficiently large and sufficiently distinct from one another (Figure S1).

Model
Test Set MSE Small MPNN 3.17 Large MPNN

2.93
Table S1: Mean Squared Error (MSE) of total loss (bond distance loss + bond angles loss) on crystal structure data for a variety of message passing neural networks (MPNNs).Test set consisted of unseen molecules.

Compound
True Toxicity (log(mol kg -1 )) Crystal-Tox Predicted Toxicity (log(mol kg -1 ))   Horizontal line indicates that none of the top 5 most likely labels were correct.Blue predictions show that Crystal-Olfaction correctly identified that the enantiomeric pair had an identical / differing olfactive profile, even if no label was correctly predicted in the top 5. Red predictions indicate that Crystal-Olfaction determined incorrectly identified the similarity of scent profile between the enantiomeric pair.Move the unzipped directory, Modular_Latent_Space-master from its current directory to the directory we created at the beginning of this section (we named it transfer_learning).Drag and drop is the easiest way to do this.
Shortly thereafter, the console will prompt you to accept the installation of new packages.Type: y

Package Installation___________________________________________________________
Next, all the relevant packages of specific versions will be installed.We specify this with the "==" sign.RDKit will be installed first.Next install numpy.This may have already been installed with a previous package.
pip install numpy==1.21.5 Then install pandas.Similar to numpy, this may already have been accomplished with a previous package installation.
pip install scikit-learn==1.0.2 We are now all set up!

Run the Transfer Learning______________________________________________________
We will be using the Buchwald-Hartwig dataset as our example transfer learning.Note that the pretrained layers from the crystal structure information will be frozen.This has been done for you and can be observed in Modular_Latent_Space/buchwald/buchwald_yield_mpnn lines 29-30 (see orange box in the figure below).

NOTE: Case sensitive! test_mol_idx
An integer between 0 and 3 if your split is NOT base.
An integer between 0 and 2 if your split IS base.
A new directory in the Modular_Latent_Space directory called transfer_learning_test.
--save_path transfer_learning_test NOTE: Assuming your current location is the Modular_Latent_Space directory.
For more flags, please refer to Modular_Latent_Space/buchwald/buchwald_yield_prediction.py lines 25 -37.For a basic transfer learning, feel free to use the default options.

Run the Transfer Learning (Finally!):
We will then run the transfer learning on the Buchwald-Hartwig dataset.The module to do so is called buchwald_yield_prediction.py and is located in the buchwald directory.Move yourself into the buchwald directory using the cd command.If you are currently in the Modular_Latent_Space directory, this can be achieved with: You can easily tell you are now in the buchwald directory by looking at the path in blue.The final name will be "buchwald" (see orange box in figure below).

Figure S3 :Figure S4 :Figure S5 :Figure S7 :Figure S8 :Figure S9 :Figure S10 :Figure S11 :
Figure S3: Distribution of LD50 values and molecular sizes for the toxicity finetuning tasks.Dark blue bars indicate training set distribution, light blue bars indicate the TDC testing set distribution, and teal bars indicate the non-pharmaceutical testing set distribution.

Figure S12 :FragranceFigure S13 :
Figure S12: The difference between the fragrance training and enantiomeric pairs testing sets, in both molecular size and in the most common odor classification classes.Dark blue bars indicate training set distribution and light blue bars indicate testing set distribution.

(
of the true olfactive notes Unzip the zip file.This can typically be achieved by double clicking on the file.
conda install -y -c rdkit rdkit==2020.09.1 This is the most critical package to get the correct version.After version 1.11, the graph nomenclature changed and attempting to run the code with later versions will result in an error.If you believe you are seeing a networkx error, please double check that the version you are running is 1.11.This can easily be done using the following commands.Start python -make sure you are in your virtual environment.Type: You should have an output of 1.11.

Table S2 :
Predicted and true toxicity values of each compound in the non-drug test set for the best Crystal-Tox and Oloren ChemEngine models.

Table S3 :
The mean absolute error (MAE) for each fold in the Buchwald-Hartwig yield prediction.For halides and additives, several were left out at a single time to allow for equal training-testing splits for all validations.Bolded entries indicate the best model for each fold.
a Crystal-Yield with output block increased from ~260K parameters to ~1 million parameters.GraphRXN had ~2 million parameters. Figure S1: Graphical representation of the process of transfer learning.The neural network may be frozen (no more training occurs) if the finetuning dataset size cannot accommodate the depth of the whole system.S8 Figure S2: Ranking of elements in our cleaned CCDC dataset.