From the journal Digital Discovery Peer review history

You do not have JavaScript enabled. Please enable JavaScript to access the full features of the site or access our non-JavaScript page.

Round 1

Manuscript submitted on 11 Dec 2023

Editor’s decision letter

19-Mar-2024

Dear Dr Martin-Martinez:

Manuscript ID: DD-ART-12-2023-000245
TITLE: Data-driven representative models to accelerate scaled-up atomistic simulations of bitumen and biobased complex fluids

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Professor Jason Hein
Associate Editor, Digital Discovery

************

Reviewer comments

Reviewer 1

Reviewer report

Code Review

Details regarding the "Data Reviewer Checklist" are shown as follows.

1a. Are all data sources listed and publicly available?
No. I cannot find anywhere the database, i.e. the <csv_file> to reproduce the findings reported in the paper. The authors should upload all raw data to github and zenodo.

1b. If using an external database, is an access date or version number
provided?
No (see answer to question 1a).

4a. Is a software implementation of the model provided such that it can be trained and tested with new data?

Not in its current version. The code first needs to be reproducible with the test cases, which will then allow being robust enough to be tested with new data.

6b. Are scripts to reproduce the findings in the paper provided?

No. Due to the answer given in question 1a, I cannot reproduce the reported findings.
Please update the README according to the suggestions I've provided below.

6c. Have the authors clearly specified which versions of the software libraries they depend upon were used in the course of the work?

After a fresh git clone, the are some errors with the installation of "requirements.txt" , i.e

$ pip install -r requirements.txt
ERROR: Could not find a version that satisfies the requirement sys (from versions: none)
ERROR: No matching distribution found for sys

And this occurs because the current version of 'requirements.txt' contains packages for which the "none" version is a bit convoluted (to make long story short), i.e. this common error happens for packages like "sys", "os", and "math", which are precisely those specified in 'requirements.txt'.

1. Firstly, in order to avoid creating issues with the local system of the user, I strongly recommend the inclusion of instructions for creating a virtual environment. Using a virtual environment prevents potential conflicts with the local system by isolating package installations. I suggest adding the following to the README.md:

2. For the best user experience, I suggest removing lines "sys", "os" and "math" from "requirements.txt", just leaving line "pandas", and writing down in the README.md the following instructions.

- from now on, everything contained between the opening and closing tags <to be written in README.md> is what I suggest it should be written in the README.md

<to be written in README.md>
# Header: Installation
## subheader: Installation of virtual environment

We first create a virtual environment to avoid dealing with other packages installed in the system:

Test folder:
`$ mkdir Test && cd Test`
`$ git clone https://github.com/MMLabCodes/data-driven-representative-models-of-biobased-complex-fluids.git`

We create a virtual environment:
`$ python -m venv biobased_venv`

To activate this environment, use
`$ source ./biobased_venv/bin/activate`

To deactivate the environment:
`$ deactivate`

## subheader: Installation of dependencies and packages:
Install requirements:

`(biobased_venv) $ python -m pip install -r requirements.txt`

Which will install pandas.
</to be written in README.md>

Reviewer 2

Dear authors,
congratulations for your great work. I enjoyed reading this paper and I'm convinced that it contributes significantly to the modelling of complex mixtures. I recommend this paper for publication after minor revision. I have the following questions for consideration:

Might the database against which you compare the results of GCMS/LCMS/HPLC analysis, or the detection limit of the method/equipment, or the molecule classification method represent a limitation? If so, would it be helpful to give some recommendations for the experimental work?

Have you considered the possibility to perform experimental validation of the different models, additional to the "all molecules" model benchmark? Would you expect significant differences?

Do you think it would be possible to automate the model selection (PT/FT/AG/SG) depending on the type of mixture (more/less diversity, etc.) to advance in your aim of author-agnostic research?

One final minor comment: the figure in page 23 does not display well.

Best regards

Reviewer 3

The authors report a nice paper titled "Data-driven representative models to accelerate scale-up atomistic simulations of bitumen and biobased complex fluids". Their approach reduces the number of atoms required for such simulations by a factor of 10 to 100, while still capturing the overall chemical features of the fluid. This is an important result and the application to asphalt additives and biofuel precursors is very relevant. Overall, the work is novel and interesting and the automation mechanics is interesting. Some comments are included below, specifically around providing more details regarding the data-driven representative models, a discussion of related literature and methods, and an outlook into future opportunities.

Specific comments:

-A nice feature of the paper is the use of DFT data (ORCA based) for developing the framework. However, the paper should better explain the methods used, especially on the algorithmic side (what type of machine learning approach, discuss related/earlier papers in that space, etc.). The current draft does not provide sufficient details and discussion of related work in the field. If the authors can include that, it would allow readers to better understand the context of the work.

-What exactly is the approach used to build "molecular classification system that selects molecules after tagging"? Is it some sort of embedding model or simply feature identification (if the latter, this should be clarified as this is a term of art that will be more widely understood)?

-What is meant by "author-agnostic computational framework "?

-The GitHub link provided in the paper does not work and the reviewer was not able to review the details of the code/algorithm

-The authors should cite relevant literature more broadly, e.g. https://doi.org/10.1016/j.jmps.2023.105454 (use of multimodal transformers for chemical, mechanical and other properties), https://doi.org/10.1063/5.0155890 (DFT data used in transformer models for solvents etc.), and others.

-Can the authors explain a bit more about the speedup obtained over current/related methods of developing coarse-grained force fields 'manually'?

-There have been other works that use automated interactive workflows, e.g. using agent-based modeling, that should be reviewed (e.g.: https://doi.org/10.1016/j.eml.2024.102131 and https://arxiv.org/abs/2402.04268).

-Figure 3 is important and should perhaps be plotted over the entire page ()like Fig 5 is, for instancel width to make the text more legible.

-Figure 9, radar plots - the caption is not on the same page as the plots (probably an issue that arises from PDF conversion)

-Same figure, some additional discussion would be helpful especially if the authors could delineate the tradeoffs better.

-Given that the features seem very specific to the chemistry considered, can the authors discuss how and why they expect it to generalize (especially vs. related methods that are used widely like transformer models to model complex chemistry)?

Author response

Please see the attached response letter: Manuscript_authors_response.pdf

This text has been copied from the PDF response to reviewers and does not include any figures, images or special characters:

Title: "Data-driven representative models to accelerate scaled-up atomistic simulations of bitumen and biobased complex fluids."

Authors: Daniel York, Isaac Vidal-Daza, Cristina Segura, Jose Norambuena-Contreras, Francisco J. Martin-Martinez

Journal Reference:

Dear editor,

The authors would like to thank the reviewers for the time taken to review the manuscript and provide invaluable feedback to help improve and refine both the manuscript and software. We have included each reviewer’s comment below and how we have addressed these, in the cases where we have edited the text these are included in italics and changes highlighted in red.

REVIEWER #1 – Code review

Reviewer's comment:
Note: both comments relate to the dataset.

1a. Are all data sources listed and publicly available?
No. I cannot find anywhere the database, i.e. the <csv_file> to reproduce the findings reported in the paper. The authors should upload all raw data to github and zenodo.

1b. If using an external database, is an access date or version number.
provided?
No (see answer to question 1a).

Response from the authors:

We thank the reviewer for pointing out the issues for testing the code. Following the reviewer’s comment, a new folder has been created in the github repository called “data”, which contains the data sources needed to test the code. There are three csv files containing the raw data required by the reviewer (each csv file corresponds to each fraction defined in the paper, i.e. complete bio-oil, phenolic fraction, and wax fraction). With this file and the updated version of the code that we provide it will be possible to reproduce the results shown in the manuscript – these files can be found at:
Main_repo  data  …_DFT_results.csv

The updated github repository can be checked here:

https://github.com/MMLabCodes/data-driven-representative-models-of-biobased-complexfluids

We have also updated the zenodo repository, which can be checked here:

https://zenodo.org/doi/10.5281/zenodo.10356309

Reviewer's comment:
4a. Is a software implementation of the model provided such that it can be trained and tested with new data?

Not in its current version. The code first needs to be reproducible with the test cases, which will then allow being robust enough to be tested with new data.

Response from the authors:

Following the reviewer’s comment a new version of the code is now available with test cases (see response to reviewer’s comment 1a for where this data can be found) for reproducibility. We have also realised that some key functions were not available in the code. We have

included these functions and made small updates to others functions to make the code easier to use. The improvements areexplained below.

A new file called “base_functions.py” has been created in the “functions” folder of the repository, which also contains functions not available in the original version of this repository, as follows.

• vol_from_smiles(smiles)
• vol_from_mol(mol)
• has_heteroatoms(mol)
• has_rings(mol)
• get_group_area(list_of_orca_molecule_class_objects)
• calculate_heteroatom_percentages(model)
• get_mixture(model)
• calculate_heteroatom_percentages_sing_mol(mol)
• find_minimum(list_of_orca_molecule_class_objects, attribute)
• min_mols_4_simulation(model, model_type)

Note: As stated above, these functions are present in a new python file. Main_repo  functions  base_functions.py

The functions in base_functions.py were not included in the original repository but are crucial for the generation of our models. We thank again the reviewer for helping us to realise those mistakes.

The other two python files in the “functions” folder have also been slightly updated as follows:

- “class_definitions.py” has an updated function to pull data from the csv files (“csv_to_orca_class(csv_file)”)
- “model_functions.py” has an updated version of the “get_weighted_average(molecules, class_attribute, model_type)” function which now utilises list comprehension.

Note: As stated above, these are edits to functions in the original version of the repository and can be found in: Main_repo  functions  class_definitions.py/model_functions.py

The main scripts that generate the models (i.e., “data_driven_model_generator.py” and “data_driven_model_generator_cmd_line.py”) have been updated to run as ‘main’ functions. Both files are found in the main folder of the repository.

We have also provided the DFT data used for the generation of models within the paper as addressed in the reviewer’s comment 1a – these datafiles (csv format) include data from both GCMS and DFT and contain everything required to reproduce the data reported.
If new data (i.e. GCMS data) from other complex mixtures needs to be tested - and models generated for these mixtures - dictionaries and lists of heteroatoms and structures may need to be updated based on the composition of those mixtures, i.e., if they contain atoms that are not N, C, O or S; or they contain larger ring structures.

The dictionaries for updating heteroatom weights can be found as follows.
• Main_repo  functions  model_functions.py  get_heteroatom_content
• Main_repo  functions  base_functions.py  calculate_heteroatom_percentages

Note: These are specific functions within the given python files.

The lists containing structures and heteroatoms SMARTS pattern for feature identification can be found as follows.
• Main_repo  functions  base_functions.py  has_rings
• Main_repo  functions  base_functions.py  has_heteroatoms
Note: These are specific functions within the given python files.

We only provide DFT data for the bio-oil mixtures produced from pine bark as described in the manuscript and any additional tests with other complex mixtures would require a new set of DFT data for the molecules in those specific mixtures.

Reviewer’s comment:
Note: both comments relate to the structure of the github repository and issues setting up and running the code.

6b. Are scripts to reproduce the findings in the paper provided?

No. Due to the answer given in question 1a, I cannot reproduce the reported findings. Please update the README according to the suggestions I’ve provided below.

6c. Have the authors clearly specified which versions of the software libraries they depend upon were used in the course of the work?

After a fresh git clone, the are some errors with the installation of “requirements.txt” (much more detail was given in this comment regarding specifics in writing the readme.md file and ensuring the requirements.txt to ensure a good user experience)

Response from the authors:

The scripts and functions they use have been updated and this is described in our response to the reviewer’s comment 4a.

The readme.md file now includes instructions to set up a virtual environment as per the reviewer’s feedback, the new contents of the file read:

# Installation
## Installation of a virtual environment

We first create a virtual environment to avoid dealing with other packages installed in the system:

Test folder:
```shell
$ mkdir Test && cd Test
$ git clone
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMMLab
Codes%2Fdata-driven-representative-models-of-biobased-complexfluids.git%2560&data=05%7C02%7CF.J.Martin-
Martinez%40Swansea.ac.uk%7C919096b1d4cb426173af08dc4834d5fb%7Cbbcab52e9fbe4 3d6a2f39f66c43df268%7C0%7C0%7C638464639261410676%7CUnknown%7CTWFpbGZs b3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0
%7C%7C%7C&sdata=OvOrJ8pG0DNPvXNc%2BMaM7NVOk7PybEykdl%2FQoQ2EdDU%3 D&reserved=0
```
We create a virtual environment: ```shell
$ python -m venv biobased_venv
```
To activate this environment, we use:
```shell
$ source ./biobased_venv/bin/activate
```
To deactivate the environment:
```shell
$ deactivate
```

Note the readme file can be found at, Main_repo  readme.md

We have also updated requirements.txt, removing python packages that are included by default and adding RDkit as an additional requirement. For the non-default python packages used (i.e.
pandas and RDkit) we have included the version within the file “requirements.txt” as so:

• rdkit==2022.9.3
• pandas==2.2.1

Note: The requirements.txt file can be found here; Main_repo  requirements.txt

The updated github repository can be checked here:

https://github.com/MMLabCodes/data-driven-representative-models-of-biobased-complexfluids

REVIEWER #2

Reviewer's comment:
Might the database against which you compare the results of GCMS/LCMS/HPLC analysis, or the detection limit of the method/equipment, or the molecule classification method represent a limitation? If so, would it be helpful to give some recommendations for the experimental work?

Response from the authors:

This is a good point made by the reviewer. Indeed, the molecules identified in the initial GCMS data that is provided to our computational platform are limited by the database against which the equipment identifies the molecules, as well as by the detection limit of the method/equipment used for the analysis. Nevertheless, these are limitation that are beyond our control and do not impact the performance of our method. These are challenges for the analytical process used to provide us with valuable data. The generation of models carried out by our method is independent to the number of molecules identified during the characterisation, although the models will be obviously more representative of the real sample if the experimental data used as input is more complete. For more detailed characterisations becoming available, our model generation method remains the same but will work over better experimental data.

To make this excellent point in the manuscript, and clarify any similar questions by the reader, we have added two paragraphs in the “introduction” (which is connected to a point made by reviewer 3) and in the “representative model development” sections. The paragraph in the introduction also includes a final sentence with recommendations to the experimental work, as suggested by the reviewer. The new paragraphs read:

“The success of these applications, as well as the capacity the models developed here to describe real complex systems are constrained by the accuracy and detection limits of the GCMS/LCMS/HPLC analysis used to generate the raw data. Therefore, the most complete analysis possible that maximises the number of molecules identified would be ideal to provide the most extensive and accurate experimental characterisation of the complex fluid.”

“Both methods are tested with GCMS data, which is limited by size of the database against which the compounds are identified, and the resolution of experimental equipment, i.e., GCMS detection limits. However, our methods for model generation are independent of the number of molecules characterised by the analytical equipment, and the same algorithms apply to any experimental data provided independently of the equipment’s resolution, as long as the proportion of each component within the complex fluid is quantifiable.”

Reviewer's comment:
Have you considered the possibility to perform experimental validation of the different models, additional to the "all molecules" model benchmark? Would you expect significant differences?

Response from the authors:

This is a very interesting point, and very challenging to address experimentally. Isolating from a complex bio-oil the molecules computationally selected by our methods is technically beyond the state of the art of any separation or purification technique, given the complexity of this bitumen-like materials. We have achieved a separation into phenolic and wax fractions, as already described. It would be also possible to perform a SARA analysis, but even with that we would just separate molecules into solubility classes. To perform an experimental validation as proposed by the reviewer, the only available route would be to buy and/or synthesise each individual molecule in the model and to mix them to create the mixture describe by the model ad-oc. This would certainly make it possible to create tailor-made mixtures that match our models, but it is beyond our experimental capabilities. Nevertheless, it is something we would like to explore in the future. However, to further consider reviewer’s suggestion, we intend to carry out molecular dynamics (MD) simulations of each of our models to simulate some collective properties of the experimental mixture and see if the MD results much the experimental data and the benchmark model that includes all components of the bio-oil. To address this comment, we have included a small paragraph in the “Representative model development” section that reads:

“Given the complexity of bio-oils, isolating individual molecules experimentally to create samples that mimic our computational models, is technically unachievable. An alternative could be the synthesis of molecular mixtures matching the composition of our models, although this generation of experimental mixtures would be dependent on the availability of the modelled compounds (i.e., cost or complexity of synthetic routes). Therefore, for further validation of our representative models, we intend to carry out molecular dynamics simulations to compare our model with available experimental data of the complete bio-oilm.”

Reviewer's comment:
Do you think it would be possible to automate the model selection (PT/FT/AG/SG) depending on the type of mixture (more/less diversity, etc.) to advance in your aim of author-agnostic research?

Response from the authors:

Thank you for the excellent comment and the good idea for the future. This is something that would be possible. It would improve the automation efficiency. Currently, in the output files generated by the code that contain information of each model, with an assigned ‘score’ to each model based on its rank against the benchmark, taking into consideration each parameter. Based on this idea suggested by the reviewer, we are planning to explore more dynamic ‘scoring’ functions, such as analytic hierarchy process (AHP) analysis which could allow for the generation of tailored models by assigning different weights to each parameter (i.e., chemical hardness, molecular weights) – the baseline of this method would consider all parameters equally. However, this will be implemented in a future version of the code.

Reviewer's comment:
One final minor comment: the figure in page 23 does not display well.

Response from the authors:

We believe it must be the PNG generation and its integration in the PDF, we have updated the figure with the hope it will display better now.

REVIEWER #3

Reviewer's comment:
A nice feature of the paper is the use of DFT data (ORCA based) for developing the framework. However, the paper should better explain the methods used, especially on the algorithmic side

(what type of machine learning approach, discuss related/earlier papers in that space, etc.).
The current draft does not provide sufficient details and discussion of related work in the field. If the authors can include that, it would allow readers to better understand the context of the work.

Response from the authors:

We thank the reviewer for this comment, which has helped us to clarify certain points in the manuscript. First, we would like to clarify that our methods facilitate the generation of data and atomistic models for the future training of AI/ML applications. Therefore, following the point made by the reviewer on the type of machine learning approach, we find important to clarify that we do not train a machine learning model, or implement an existing one, but we generate data for future training. Furthermore, in the manuscript, we describe two main approaches for model generation, the “abundancy-based model generation system” and the “molecular classification generation system”, following the advice by the reviewer on better explaining the methods used, we have expanded the section of the manuscript where our methods are explained. Additionally, Figures 3 and 5 further explain those algorithms. Below we highlighted some of these sections from the manuscript explaining our methodologies and algorithms incorporated.

“Abundancy-based model generation system:

Fig. 3 summarises the procedure for the abundancy-based model generation system, indicating the FT method (pale purple), the PT method (pale green-blue) and its automation using Python code.

Conceptually, the FT method implements a locked selection rule, which is tailored to a specific bio-oil composition and abundancies each compound. For the pine bark bio-oil case study, we defined 5% as the fixed abundancy selection criterion. On the other hand, the PT method considers the relative abundancy of each molecule in the mixture, allowing for a dynamic selection rule that is defined for an individual mixture on a case-by-case basis, as described in equation (9). Fig. 4 shows the GCMS data and highlights the molecules that were selected using the FT method (in pale purple) and those selected using the PT method (in pale greenblue) for both the phenolic (Fig. 4a) and wax (Fig. 4b) fractions.

The PT method yields a detailed atomistic model that is a statistical representation of the mixture, whilst the FT aims to select the molecules that constitute the largest proportions in the composition of the bio-oil. Because of its definition, the PT model will typically include all those molecules that were considered in the FT model plus some additional ones unless a mixture is extremely diverse and there were no molecules abundant enough to be selected by a given fixed threshold (e.g., 5%), or if a threshold below the one defined by the PT method was chosen. Fig. 4 shows how, in our case study, all the molecules selected by the FT method are also selected by the PT one, and no “FT-only molecules” are highlighted in the plot.”

“Molecular classification model generation system:

The core concept in the molecular classification system is that each molecular subclass is important in the mixture as each may be key to adequately describing the overall chemical properties, even if they are only present in trace amounts. Accordingly, molecules are classified using a feature identification algorithm based on their chemical structure and atomic composition, followed by either a selection of the most abundant ones from each subclass, i.e., AG method, or a selection based on a scoring function, i.e., SG method. Our feature identification algorithm focuses on general structures and heteroatom compositions of each molecule rather than specific criteria like type of functional groups or characteristics of branching (i.e. degree of branching, presence of π bonds in branches and number of different branches. This approach is deliberately simple and aims to develop a generalized method for any mixture, whilst preventing overclassification which would increase the complexity of resulting models.”

The SG method's scoring function quantifies how a given molecule's molecular properties match the weighted averages of each DFT-calculated molecular descriptor. The final
expression for the scoring function is described by equation (10), where !"", !"", etc., refer to the weighted averages of the properties, and , , etc., are the corresponding values of those properties for a molecule within the subclass under consideration. The molecules with the best score, i.e., closest to 1.0, are then selected to represent that subclass in the final atomistic model. “

We also included a section in the “introduction” with the references that the reviewer suggested in relation to the application and ML and AI, and some additional ones. This section reads now as:

“The exponential growth of machine learning applications and the need for large data sets for training and validation add additional value to this automated generation of models and the performance of DFT calculations. Utilising these atomistic models for machine learning applications can accelerate a bottom-up design and property prediction of complex fluids, an approach that has been successfully used in synthetic biology, catalyst development and even controlled lithium deposition.30–32 Existing transformer models could also be utilised to suggest target molecules during the production of complex fluids, and to predict the properties of candidate molecules.33,34 Other similar areas that also use artificial intelligence to find solutions to complex problems could assist the further development of model generation methodologies and validation methods using adequate training data.35,36 The success of these applications, as well as the capacity the models developed here to describe real complex systems are constrained by the accuracy and detection limits of the GCMS/LCMS/HPLC analysis used to generate the raw data. Therefore, the most complete analysis possible that maximises the number of molecules identified would be ideal to provide the most extensive and accurate experimental characterisation of the complex fluid.”

Reviewer's comment:
What exactly is the approach used to build a “molecular classification system that selects molecules after tagging"? Is it some sort of embedding model or simply feature identification (if the latter, this should be clarified as this is a term of art that will be more widely understood)?

Response from the authors:

As the reviewer point out, it is indeed a feature identification. We have made slight changes to the explanation of the molecular classification system and have explicitly described the “feature identification” procedure to clarify. The sentence now reads:

“…and (ii) a molecular classification system that selects molecules after grouping them into molecular classes and subsequent subclasses using feature identification dependent on structural features and atomic composition.”

We described this “feature identification” in the second sentence of the molecular classification model generation system section where those methods are explained in more detail. The sentence now reads:

“Accordingly, molecules are classified using a feature identification algorithm based on their chemical structure and atomic composition.”

We also had another sentence referring to the classification of molecules in the “molecular classification model generation system” section and have clarify again that it is a “feature identification”. The new sentence now reads:

“Once these heteroatom-containing classes are defined, the method defines subclasses based on chemical structure using feature identification, i.e., presence of rings and different ring sizes (Fig. 5).”

Reviewer’s comment:
What is meant by “author-agnostic computational framework “?

Response from the authors:

By “author-agnostic” we mean a framework that it is independent of human biases and individual’s taste in the selection of molecules, but rather is algorithm-driven. We have further clarified what we mean by this statement in the introduction by adding more context when we introduce this idea. The paragraph now reads:

“We believe that an author-agnostic platform will reduce human biases as our automated datadriven approach develops atomistic models of bitumen and biobased complex fluids based solely on experimental data and selection algorithms that are independent of the users end goals or specific case. We believe that this will boost the performance of data-driven atomistic simulations and allow for atomistic models to be generated via a predefined methodology rather than on a case-by-case basis.”

Reviewer’s comment:
The GitHub link provided in the paper does not work and the reviewer was not able to review the details of the code/algorithm

Response from the authors:

We have addressed this issue together with the answers to the comments from reviewer 1. We include all changes that we made to the github repository below.

A new folder has been created in the github repository called “data”, which contains the data sources needed to test the code. There are three csv files containing the raw data required by the reviewer (each csv file corresponds to each fraction defined in the paper, i.e. complete biooil, phenolic fraction, and wax fraction). With this file and the updated version of the code that we provide it will be possible to reproduce the results shown in the manuscript – these files can be found at;
Main_repo  data  …_DFT_results.csv

We have also realised that some key functions were not available in the code. We have included these functions and made small updates to others functions to make the code easier to use. The improvements areexplained below.

A new file called “base_functions.py” has been created in the “functions” folder of the repository, which also contains functions not available in the original version of this repository, as follows.

• vol_from_smiles(smiles)
• vol_from_mol(mol)
• has_heteroatoms(mol)
• has_rings(mol)
• get_group_area(list_of_orca_molecule_class_objects)
• calculate_heteroatom_percentages(model)
• get_mixture(model)
• calculate_heteroatom_percentages_sing_mol(mol)
• find_minimum(list_of_orca_molecule_class_objects, attribute)
• min_mols_4_simulation(model, model_type)

Note: As stated above, these functions are present in a new python file. Main_repo  functions  base_functions.py

The functions in base_functions.py were not included in the original repository but are crucial for the generation of our models. We thank again the reviewer for helping us to realise those mistakes.

The other two python files in the “functions” folder have also been slightly updated as follows:

- “class_definitions.py” has an updated function to pull data from the csv files
(“csv_to_orca_class(csv_file)”)

- “model_functions.py” has an updated version of the “get_weighted_average(molecules, class_attribute, model_type)” function which now utilises list comprehension.

Note: As stated above, these are edits to functions in the original version of the repository and can be found in: Main_repo  functions  class_definitions.py/model_functions.py

The main scripts that generate the models (i.e., “data_driven_model_generator.py” and “data_driven_model_generator_cmd_line.py”) have been updated to run as ‘main’ functions. Both files are found in the main folder of the repository.

We have also provided the DFT data used for the generation of models within the paper as addressed in the reviewer’s comment 1a – these datafiles (csv format) include data from both GCMS and DFT and contain everything required to reproduce the data reported.
If new data (i.e. GCMS data) from other complex mixtures needs to be tested - and models generated for these mixtures - dictionaries and lists of heteroatoms and structures may need to be updated based on the composition of those mixtures, i.e., if they contain atoms that are not N, C, O or S; or they contain larger ring structures.

The dictionaries for updating heteroatom weights can be found as follows.
• Main_repo  functions  model_functions.py  get_heteroatom_content
• Main_repo  functions  base_functions.py  calculate_heteroatom_percentages

Note: These are specific functions within the given python files.

The lists containing structures and heteroatoms SMARTS pattern for feature identification can be found as follows.
• Main_repo  functions  base_functions.py  has_rings
• Main_repo  functions  base_functions.py  has_heteroatoms

Note: These are specific functions within the given python files.

We only provide DFT data for the bio-oil mixtures produced from pine bark as described in the manuscript and any additional tests with other complex mixtures would require a new set of DFT data for the molecules in those specific mixtures.

The readme.md file now includes instructions to set up a virtual environment as per the reviewer’s feedback, the new contents of the file read:

# Installation
## Installation of a virtual environment

We first create a virtual environment to avoid dealing with other packages installed in the system:

Test folder:
```shell
$ mkdir Test && cd Test
$ git clone
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMMLab
Codes%2Fdata-driven-representative-models-of-biobased-complexfluids.git%2560&data=05%7C02%7CF.J.Martin-
Martinez%40Swansea.ac.uk%7C919096b1d4cb426173af08dc4834d5fb%7Cbbcab52e9fbe4 3d6a2f39f66c43df268%7C0%7C0%7C638464639261410676%7CUnknown%7CTWFpbGZs b3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0
%7C%7C%7C&sdata=OvOrJ8pG0DNPvXNc%2BMaM7NVOk7PybEykdl%2FQoQ2EdDU%3
D&reserved=0
```
We create a virtual environment:
```shell
$ python -m venv biobased_venv
```
To activate this environment, we use:
```shell
$ source ./biobased_venv/bin/activate
```
To deactivate the environment:
```shell
$ deactivate
```

Note the readme file can be found at, Main_repo  readme.md

We have also updated requirements.txt, removing python packages that are included by default and adding RDkit as an additional requirement. For the non-default python packages used (i.e.
pandas and RDkit) we have included the version within the file “requirements.txt” as so:

• rdkit==2022.9.3
• pandas==2.2.1

Note: The requirements.txt file can be found here; Main_repo  requirements.txt

The updated github repository can be checked here:

https://github.com/MMLabCodes/data-driven-representative-models-of-biobased-complexfluids

We have also updated the zenodo repository, which can be checked here:

https://zenodo.org/doi/10.5281/zenodo.10356309

Reviewer’s comment:
The authors should cite relevant literature more broadly, (use of multimodal transformers for chemical, mechanical and other properties), (DFT data used in transformer models for solvents etc.), and others.

Response from the authors:

We strongly appreciate the suggestion and additional resources provided by the reviewer. We agree that our introduction focusses on the existing models of biocrude, bio-oils and asphalt, and that will benefit from discussing other areas of the modelling field that utilise machine learning, artificial intelligence, and bottom-up design.

To amend this, we have added an additional paragraph, and several citations (including those provided by the reviewer) in the introduction that conceptualizes potential uses of our models within ML and AI spaces and included some references to ML and LLM models and papers that focus on bottom-up design in different fields. The new paragraph reads:

“The exponential growth of machine learning applications and the need for large data sets for training and validation add additional value to this automated generation of models and the performance of DFT calculations. Utilising these atomistic models for machine learning applications can accelerate a bottom-up design and property prediction of complex fluids, an approach that has been successfully used in synthetic biology, catalyst development and even controlled lithium deposition.30–32 Existing transformer models could also be utilised to suggest target molecules during the production of complex fluids, and to predict the properties of candidate molecules.33,34 Other similar areas that also use artificial intelligence to find solutions to complex problems could assist the further development of model generation methodologies and validation methods using adequate training data.”

The references added are:

30 S. Majumder and A. P. Liu, Phys. Biol., 2017, 15, 013001.
31 Z. Yang, J. Li, Y. Ling, Q. Zhang, X. Yu and W. Cai, ChemCatChem, 2017, 9, 3307–3313.
32 X. Wang, Z. Chen, K. Jiang, M. Chen and S. Passerini, Advanced Energy Materials, n/a, 2304229.
33 R. K. Luu, M. Wysokowski and M. J. Buehler, Applied Physics Letters, 2023, 122, 234103.
34 M. J. Buehler, Journal of the Mechanics and Physics of Solids, 2023, 181, 105454.

Reviewer’s comment:
Can the authors explain a bit more about the speedup obtained over current/related methods of developing coarse-grained force fields ‘manually’?

Response from the authors:

We agree that this could be a bit confusing, and we have tried to clarify it better in the text, following the question from the reviewer. The existing atomistic models developed by chemical intuition have already helped to perform some DFT, full-atomistic MD, and even CoarseGrained models, as we explained in the text. We further clarify this point by editing this section as follows:

“The resulting selection of molecules provided by strategies based on chemical intuition and personal selection yielded a starting point for computational studies that needed atomistic models, i.e., density functional theory (DFT) calculations and full-atomistic MD simulations,21,22,23,24,25,26 and also for those coarse grain models developed from such selection of molecules.27,28,29”

With our approach, we hope to speed-up the development of atomistic models that are less dependent of human preferences for DFT, full-atomistic MD, and eventually any coarse-grain model that needs atomistic models as a foundation. Nevertheless, coarse-grain models are not in the scope of the current manuscript, which is focused on generating representative models for full-atomistic simulations. We provided references to coarse grain models of asphalt, as we have reviewed this area broadly, since it could be impacted by our models.

We hope it is clearer now.

Reviewer’s comment:
There have been other works that use automated interactive workflows, e.g. using agent-based modeling, that should be reviewed

Response from the authors:

We would like to thank the reviewer for providing these references that we were not aware of and pointing out the work by Prof. Markus Buehler at MIT. In our current manuscript, we have not considered the use of agent-based modelling, but we think that it is an excellent idea for future development and improvement of our models. We have included the references suggested by the reviewer in the introduction, and we have suggested how agent-based modelling could be used in conjunction with the work presented here. The sentence that references these articles reads:

“Other similar areas that also use artificial intelligence to find solutions to complex problems could assist the further development of model generation methodologies and validation methods using adequate training data.35,36”

The references added are:
35 B. Ni and M. J. Buehler, Extreme Mechanics Letters, 2024, 67, 102131.
36 A. Ghafarollahi and M. J. Buehler, arXiv, 2024.

Reviewer’s comment:
Figure 3 is important and should perhaps be plotted over the entire page ()like Fig 5 is, for instance width to make the text more legible.

Response from the authors:

We agree the figure is difficult to read and have plotted the figure over both columns to match the size of figure 5.

Reviewer's comment:
Figure 9, radar plots - the caption is not on the same page as the plots (probably an issue that arises from PDF conversion).

Same figure, some additional discussion would be helpful especially if the authors could delineate the tradeoffs better.

Response from the authors:

Regarding the caption for figure 9, we realise it spans across two columns due to the size of the figure. We will discuss with the editorial office when preparing the manuscript for publication, if accepted.

We have additional discussion to the “comparative performance of data-driven representative models” section to provide a better a better evaluation of the data presented in the radar plots. The entire section now reads:

“Comparing the abundancy-based methods in fig. 9, the PT model outperforms the FT model overall in all cases because a larger number of molecules are selected with the PT method. Nevertheless, the PT model slightly underestimates the total energy, the polarizability, and the oxygen context and slightly overestimates the dipole moment. In particular, the PT method performs exceptionally well for the complete bio-oil (Fig. 9a). It also matches most weighted average descriptors of the phenolic fraction (fig. 9b), although it clearly underestimates the oxygen content. The proportion of molecules containing two oxygen atoms is 33% lower than it is in the all-molecule model as there are 8 molecules containing multiple oxygen atoms that do not meet the PT model criteria yet account for a combined abundancy of 8.5% within the phenolic fraction. However, it very accurately describes the reactivity, which is especially relevant in this type of material. Compared to the FT model, the performance of the PT model highlights the significance of creating a statistical representation of the bio-oil rather than just selecting the most abundant molecules in each mixture. Overall, the FT method performs best when modelling the phenolic fraction (Fig. 9b) because most molecules in the fraction are phenols (80%), and all molecules selected are from this molecular subclass. Conversely, the FT method performs worst when modelling the wax fraction (Fig. 9c) as the composition is much more diverse and more challenging to be encapsulated with a pre-defined selection rule. Generally, simpler, or more homogenous mixtures (i.e., most molecules belonging to the same molecular subclass) are more effectively modelled by the FT method.

Shifting the focus to the molecular classification system, the SG model outperforms the AG model in all cases, although the AG model seems to be better suited for less diverse mixtures where a molecule clearly dominates the composition of a molecular subclass. The consistent performance of the SG model indicates that the use of a scoring function provides a superior method for molecule selection among molecular subclasses, resulting in a better representation of weighted average descriptors. Like the PT models of each fraction (i.e. complete bio-oil, phenolic and wax), the SG one closely matches the weighted average global chemical hardness of the all-molecule benchmark in each case. It is also the most accurate describing molecular weight and oxygen content. This variation between the performance of the SG and AG methods occurs due to the existence of molecular subclasses with several molecules, i.e. the hydrocarbon, fatty acid, and phenolic subclasses. In these cases, different molecules are selected by each of the methods where the most abundant molecule, selected by the AG method, differs from that selected by the SG method. In these cases, the most

abundant compound in the subclass does not represent the average descriptors of that subclass.
The AG method performed best when modelling the phenolic fraction (fig. 9b), even though the SG model still performed better. This improved performance of the AG model is attributed to the most abundant molecules in many of the molecular subclasses being those that most closely match the average descriptors of that subclass – in these cases molecular subclasses are small or contain molecules with very similar properties (i.e. fatty acids of differing lengths). Only the molecule selected from the phenolic molecular subclass differs in the models generated for the phenolic fraction using the AG and SG methods.

From the analysis of Fig. 9, there is no clear champion model, but the most suitable method varies depending on the mixture being modelled, hence the development of an automatic platform for model generation and validation. In the case of the complete bio-oil and the phenolic fraction, the PT method performs best when comparing the model’s weighted average descriptors against those of the all-molecule model. Differently, the SG method provides a better model for the wax fraction (Fig. 9c), even though the PT model contains three times as many molecules, highlighting the situations where each excels. The PT model performs better when the bio-oil fraction is dominated by a single molecular subclass (i.e. the characterisation identified the fraction to be comprised of 86.6% phenolic compounds) and a statistical sample of this subclass is optimal. Whereas the SG model triumphs where the bio-oil fraction is not solely dominated by a single molecular subclass, and it is more important to represent each class with the molecule that most closely matches the average descriptors for that class.”

We have also included additional discussion in the “Distribution of chemical reactivity descriptors” to delineate the discussion and link the sections together in a better way, This section now reads:

“Given the relevance of bio-oils for asphalt rejuvenators or upgrading to biofuels, we found important to describe the distribution of individual values of global chemical hardness across the sampled molecules in all the representative models, in addition to the average descriptors. Fig. 10 shows histograms with the distribution of values for global chemical hardness in the allmolecule model as well as in each of the representative models. Whilst the distribution of global chemical hardness across models is the most important for our uses, distributions of the other weighted descriptors, i.e., polarizability, dipole moment, and molecular weight, are also calculated and provided in the SI.

As seen in Fig. 10, the FT model fails to represent the distribution of global chemical hardness in all cases (Fig. 10a-c), and unlike the PT method, it does not select enough molecules to reflect the diversity of values in the reactivity descriptor accurately. Whereas the AG and SG models include a much better distribution of global chemical hardness. For instance, in the case of the phenolic fraction (Fig. 10b), the AG and SG models include an extra peak that corresponds to the hydrocarbon subclass, which is not present in the FT or PT model histograms. This occurs as hydrocarbons are present in very small amounts in the bio-oil and do not meet the selection criteria of the FT and PT models but are included in the AG and SG models by design. A similar observation is seen where the FT and PT models fail to capture the distribution of global chemical hardness in the wax fraction (Fig. 10c), and the peak between 3.5 and 4 eV is absent in both models. This peak corresponds to the phenolic molecular subclass and, which again, is present in the AG and SG models by design, highlighting that the omission of phenols by the abundancy-based methods affects the description of the chemical properties of the mixture, particularly the description of their reactivity.”

Finally, we also included additional discussion in the “Model sensitivity analysis of omitted molecular subclasses” section to address the reviewer’s comment. The section now reads:

“Since the omission of phenols by the abundancy-based methods when modelling the wax fraction has been highlighted to be relevant to the description of weighted average descriptors and weighted descriptors distribution, we implemented a model sensitivity analysis to analyse the effect that omissions of each molecular subclass have on the weighted descriptors
distribution, and especially on the global chemical hardness. This helps provide information on which approach (i.e. abundancy-based model generation system or molecular classification model generation system) is most suitable for a given mixture. A breakdown of the molecular subclasses included in each representative model for the complete bio-oil and the phenolic and wax fractions is shown in Table 5., Fig. 11 shows the distribution of global chemical hardness where each different molecular class has been deliberately omitted from the all-molecule model for the complete bio-oil, the phenolic and wax fractions.

The analysis shown in Fig. 11 identifies key molecular subclasses for the distribution of global chemical hardness in the bio-oil and the two main fractions. For example, this distribution in the phenolic fraction is mostly affected by the omission of phenols and fatty acids but nominally by the omission of benzofurans, furans and hydrocarbons (Fig. 11b) and in fig. 9b the performance of the PT model is not affected by the exclusion of benzofurans and hydrocarbons. In the wax fraction only the omission of phenols affects the distribution (Fig. 11c). This analysis highlights phenols to be included in any model, and it supports the implementation of a molecular-classification system, which is more suited for modelling the wax fraction.”

Reviewer's comment:
Given that the features seem very specific to the chemistry considered, can the authors discuss how and why they expect it to generalize (especially vs. related methods that are used widely like transformer models to model complex chemistry)?

Response from the authors:

We thank the reviewer for the comment, and we have provided more detail on the intent of our molecular classifier algorithm and how we expect it to generalize. We have a feature identification system that is based solely on structure and heteroatom content to avoid overclassification and the possibility of misclassification of molecules (i.e., more specific classification using functional groups, degree of branching, and number of different branches). This also means that classification methods are expected to hold up with different mixtures but there is scope for expanding these in future versions of the code. Regarding the use of transformer models, we think that these could be useful to classify molecules for our purposes; though to train this model we would be utilising the same feature identification algorithm as used to classify our molecules. We have included 2 sentences in the “Molecular classification model generation system” section that reads:

“Our feature identification algorithm focuses on general structures and heteroatom compositions of each molecule rather than specific criteria like type of functional groups or characteristics of branching (i.e. degree of branching, presence of π bonds in branches and number of different branches. This approach is deliberately simple and aims to develop a generalized method for any mixture, whilst preventing over-classification which would increase the complexity of resulting models.”

ADDITIONAL CHANGES

We have made some additional changes to the script to improve flow and provide further context in some cases. Sections of the text that contain alterations are in italics and any changes are noted in these comment boxes in red.

Section: Abstract
Author comment:

We have made a small change to the first sentence of the abstract to change “biorefinery” to “refineries” the sentence now reads:

“Complex molecular organic fluids such as bitumen, lubricants, crude oil, or biobased oils from biorefineries are intrinsically challenging to model with molecular precision, given the large variety and complexity of organic molecules in their composition.”

Section: Computational details
Author comment:

We have made a small change to the first two sentences of this section reads. Explicitly stating that molecular geometries are optimized rather than saying “All the molecular structures were fully optimized”. These two sentences now read:

“DFT calculations were performed using ORCA 4.2.1.38 Geometries of all molecular structures were fully optimised, and the electron density and the associated chemical properties for each molecule were calculated with the PBEh-3c functional.”

Section: Representative model development
Author comment:

We have made a small change to the sentence describing the idea of “disproportionate relative abundancies” and have trimmed this sentence to more efficiently explain this. The new sentence reads:

“For the molecular-classification system, an extra consideration is required as the selected molecules may vary vastly in their relative abundancies in GCMS data, creating disproportional representation of molecules within the models. To mitigate the effect of these disproportionate relative abundancies, a correction is implemented using equation (5), …”

Section: Representative model development
Author comment:

We have made a small change to the final paragraph in this section to provide context of why we include an analysis where molecule subclasses are omitted, the sentence now reads:

“We also carry out an analysis of the omission of molecular subclasses in the all-molecule model of each fraction and their effect on the model’s representability of the properties of the complete bio-oil was also performed to evaluate the necessity of including each individual molecular subclass.”

Section: Representative models applicability to multi-scale atomistic simulations
Author comment:

We made a small edit in the second paragraph to make our discussion clearer. The edited sentence now reads:

“For example, to represent each possible molecular interaction for molecule A in the FT model constituted by five molecules (FT model of the phenolic fraction), and only considering bimolecular interactions occurring simultaneously, at least six molecules of A would be required to allow for each possible interaction, …”

Round 2

Revised manuscript submitted on 12 Apr 2024

Editor’s decision letter

15-Apr-2024

Dear Dr Martin-Martinez:

Manuscript ID: DD-ART-12-2023-000245.R1
TITLE: Data-driven representative models to accelerate scaled-up atomistic simulations of bitumen and biobased complex fluids

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our LinkedIn account [https://rsc.li/Digital_showcase] please fill out this form: https://form.jotform.com/213544038469056.

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Professor Jason Hein
Associate Editor, Digital Discovery

******
******

Please contact the journal at digitaldiscovery@rsc.org

************************************

DISCLAIMER:

This communication is from The Royal Society of Chemistry, a company incorporated in England by Royal Charter (registered number RC000524) and a charity registered in England and Wales (charity number 207890). Registered office: Burlington House, Piccadilly, London W1J 0BA. Telephone: +44 (0) 20 7437 8656.

The content of this communication (including any attachments) is confidential, and may be privileged or contain copyright material. It may not be relied upon or disclosed to any person other than the intended recipient(s) without the consent of The Royal Society of Chemistry. If you are not the intended recipient(s), please (1) notify us immediately by replying to this email, (2) delete all copies from your system, and (3) note that disclosure, distribution, copying or use of this communication is strictly prohibited.

Any advice given by The Royal Society of Chemistry has been carefully formulated but is based on the information available to it. The Royal Society of Chemistry cannot be held responsible for accuracy or completeness of this communication or any attachment. Any views or opinions presented in this email are solely those of the author and do not represent those of The Royal Society of Chemistry. The views expressed in this communication are personal to the sender and unless specifically stated, this e-mail does not constitute any part of an offer or contract. The Royal Society of Chemistry shall not be liable for any resulting damage or loss as a result of the use of this email and/or attachments, or for the consequences of any actions taken on the basis of the information provided. The Royal Society of Chemistry does not warrant that its emails or attachments are Virus-free; The Royal Society of Chemistry has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Please rely on your own screening of electronic communication.

More information on The Royal Society of Chemistry can be found on our website: www.rsc.org

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.