Peer review - Multi-constraint molecular generation using sparsely labelled training data for localized high-concentration electrolyte diluent screening

13-Mar-2023

Dear Dr Mailoa:

Manuscript ID: DD-ART-02-2023-000013
TITLE: Multi-Constraint Molecular Generation using Sparsely Labelled Training Data for Localized High-Concentration Electrolyte Diluent Screening

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of the manuscript and reviewers’ comments, I regret to inform you that I do not find your manuscript suitable for publication and therefore it has been rejected in its current form. The reviewers found that your technical or methodological advances needed to be described in more detail and to be more accessible to the reader in the manuscript and/or open source code. In addition, the manuscript did not have benchmarks demonstrating improvement against the original SSVAE model.

However, if you are able to fully address the concerns raised by the reviewers in the reports below, I will consider a substantially rewritten manuscript which takes into account all of the reviewers’ comments. If you choose to resubmit your manuscript, please include a point by point response to the reviewers’ comments and highlight the changes you have made.

Your manuscript will receive a new manuscript ID and submission date and further peer review will be necessary. Please note that re-submitting your manuscript does not guarantee its acceptance in Digital Discovery.

You can re-submit your manuscript using this link, which will remain valid for six months:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(Please note that this link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) and click on "Create a Resubmission" located next to the manuscript number. You will need your case-sensitive USER ID and password to login

I look forward to receiving your re-submission.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 1

The authors in this work have proposed a self-supervised learning based model for molecule generation that is capable of handling partially missing properties from the prediction property set for each molecule during training. Thus, the proposed model can handle completely labeled, unlabeled and partially labeled molecule during training so as to create a better conditional generative model. This work will be useful to other researchers working on the area of generative modeling. However, this reviewer feels some additional work is required before it can be published.

The area where this paper requires additional experiments and explanations are

Missing Citation:
1. Their are few works done on RL for molecule generation but author have mentioned only one paper [Ref 7]. For example: “Deep reinforcement learning for de novo drug design”. There are few more papers in RL for molecule generation. Author needs to do extensive literature review
2. Major missing citation: “Semi-Supervised Learning by Deep Generative Models” by Kingma et al. is the very first foundational work on semi-supervised learning in VAE. Authors have not cited this paper. Note: Ref 7 in authors paper is based on Kingma’s paper.

Major Issues:
1. Derivation of Equation 1 and 2 needs to be shown.
(a). From equation 1, I can understand that author’s have assumed normal distribution for the predicted property. This converts the equation 6 from Kingma et al. paper into equation 1 in authors paper. This may not be clear to readers. Hence, for equation 1 author needs to show original VAE equation and final equation after substitution . (This can be shown in supplementary )
(b). Equation 2 is based on equation 7 from Kingma et al. paper, which requires marginalization over the unobserved variable y. Here, it means integration over the normal distribution of the property y. It will be good if author can show detailed derivation of how they are getting equation 2 form the original equation 7 from Kingma's paper.(This can be shown in supplementary). Currently, it is hard to understand how author is getting all the terms related to y in equation 2.
(c). In Equation 3, author should clarify why prediction is mu(y_p) and not y_p. This is due to the normal distribution assumption for y_p but has not been mentioned anywhere in the manuscript
2. Page 17: is the linear layer that is added on top of BERT was tunable in both in Encode and Predictor or just the Predictor . If Endoder weight is always fixed than what is the purpose of adding this linear layer in encoder.
3. It might be possible that features that’s been captured using transfer learning approach is not suitable for the underlying dateset used in this problem. Can author just use transformer based encoder and predictor without any fine-tuning/transfer learning. This way we will have fare comparison between RNN and transformer
4. For BERT does the Beta value from equation 3 have any impact on its performance
5. For the dateset used for training in table 2 and 3, can author also mention the distribution/percentage of labeled, unlabeled and partially labeled set
6. Lack of baseline comparison: For table 2 and 3 can author also show the models performance comparison with respect to original model (SSVAE). I understand that SSVAE can’t handle partially labeled set but we can still train it on the subset of the dataset (labeled & unlabeled). After training, author can then compare its performance with respect to their model. Note: Table 1 only verified whether authors proposed model and SSVAE gives similar performance on SSVAE’s paper dataset. But author have not shows how SSVAE perform on their dataset.
7. shows SSVAE performance in Fig 4 using the results from above point 6.

Minor issues:
1. on page 13 on this line “For entries where multiple property labels are available from different databases, we choose the available label from the latest database being merged.” Why latest and not average of all sources.
2. On Page 14 , does removing log(det(C) form equation 5 has any impact on marganization of y
3. On Page 15 on this line “using the incomplete entries, making these matrices ill-define ”. Is it here "incomplete" or "complete"
4. Form Page 16: In might be useful to show the impact of constant E and C vers. varying E and C on overall performance on the model for the dateset that’s used for training in Table 2.
5. From Page 18: Can decoder handle arbitrary number of conditional variables in a singe model. For example (Mol. Wt, LogP, IE) or (LogP, IE) or (Mol. Wt, LogP). Note: One approach here is to have three different models for these conditional variables but can author's model can handle all these three types in a single model.

Reviewer 2

This paper proposes an improved method to do conditional generation using both labeled and mix-labelled property datasets based on an open source model SSVAE. However, their technical contribution is trivial which is just an implementation trick that uses a masked matrix to facilitate the information flow of the Y labels in the original SSVAE framework. They claimed they have reimplemented the SSVAE with adjustment using pytorch, which might be a good contribution to the field. However, their code is not open-sourced. The manuscript also did not show sufficient results that justify that multiple properties info helps to find better materials. overall, I see little methodology or technical contribution to the generative design community.

Several additional comments:

1) The results in Table 2 are not sufficient to justify the advantage of the proposed multi-property SSVAE.
Table2 does not show how many candidate samples have been generated to calculate those mean/stddevs.
The experiment design is not correct here:
to show the advantage of the multi-property SSVAE, you need to train your model and generate e.g. 100,000 samples, and then you find e.g. 15% of generated samples satisfy the required multi-property requirements.
In comparison, using only one-property SSVAE, out of 100,000 generated samples, only 10% satisfy this requirement.
Table 2 does not show such results and cannot be used to justify the concluson of the paper

2) The case study is not very convincing.
the paper shows some generated samples with some good predicted property. But none of their generated samples are experimentally validated.

To address this limitation, they can do the leave-out cross-validation experiments.
First leave some known experimentally verified Li-ion battery LHCE diluent molecules out and test how many samples they need to generate to recover those samples. This validation method does not require their direct experimental validation. but the results are validated by test samples that have already been validated experimentally before.

3) I strongly suggest they open-source their code to make some useful contribution to the community to make up the limited technical contribution.

Editor’s decision letter

05-Jun-2023

Dear Dr Mailoa:

Manuscript ID: DD-ART-04-2023-000064
TITLE: Multi-Constraint Molecular Generation using Sparsely Labelled Training Data for Localized High-Concentration Electrolyte Diluent Screening

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary. In particular, both reviewers noted that the manuscript must be revised to include a baseline comparison, which demonstrates advances of your method over the existing SSVAE architecture.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript will be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 3

The authors introduces a modification to SSVAE architecture that enable the use of data that is partially labeled across different label types, and demonstrate its application for Li-ion battery design. The main technical contribution of this paper is that the model (ConGen) improves over SSVAE by being able to utilize a larger dataset with different missing labels, while SSVAE only takes fully labeled or fully unlabeled data as input. I find that the authors did not address the common feedback from both of the previous reviewers, which is that the authors did not sufficiently demonstrate the advantage of using ConGen over SSVAE.

The authors’ response to reviewer 1 asking about baseline comparisons in Table 2 and 3 is that that there are only 11 molecules with all 5 fully labelled properties, and it is not meaningful to train an SSVAE model with 11 fully labelled and 372,199 unlabelled molecules. However, the authors state that they have 372,210 Mol.Wt, 310,000 LogP, 310,000 QED, 10 55,748 EA, and 52,346 IE labels available. This means that the authors can train 5 separate SSVAE models for each property, for which the authors do have sufficient data. If the ConGen model performs worse than SSVAE trained separately on each property, this means that ConGen is not able to benefit from training on multiple properties and utilize a larger training set, in which case there would be no technical contribution from this paper.

While the authors made a substantial effort to address the reviewer comments, the main technical contribution of this paper is not sufficiently supported, even after both the reviewers asked for evidence. For this, I am unable to recommend this manuscript for publication.

Reviewer 1

Authors have answered most of my questions but still requires some work before it can be published.
1. In the section ‘ConGen Advantage on Multi-Condition Generative Design “
a. In generated unique molecules from the multi- constrained and single constraint, author needs to tell how many of these unique molecules are not present in the training set . For multi- constrain it says 14628 unique molecule are generated but does not mention how many of them matches molecules from training data
b. is single constraint molecules generated using authors proposed ConGen model or SSVAE

2. Lack of Baseline Comparison: For results in table 2, I understand due to extremely low labeled examples for all properties it is not possible to train SSVAE for baseline comparison . However, here for baseline comparison can author only use properties of the dataset that is easily available (say using RDKit) and then randomly partition them into labeled/unlabelled or labeled/unlabelled/partially labelled for training to compare SSVAE and ConGen

Minor Correction:
1. page 20 line 15 Spelling error of is used twice “If we are interested in generating a chemical space satisfying a number of of “
2. Page 32: After this sentence no explanation is provided why author expected such outcome “We have expected the transferred BERT-based ConGen to perform worse than the RNN-based ConGen on abundant property label such as Mol.Wt and LogP and better than RNN- based ConGen on rare property label such as IE. “

Author response

Reviewer #3:

The authors introduces a modification to SSVAE architecture that enable the use of data that is partially labeled across different label types, and demonstrate its application for Li-ion battery design. The main technical contribution of this paper is that the model (ConGen) improves over SSVAE by being able to utilize a larger dataset with different missing labels, while SSVAE only takes fully labeled or fully unlabeled data as input. I find that the authors did not address the common feedback from both of the previous reviewers, which is that the authors did not sufficiently demonstrate the advantage of using ConGen over SSVAE.

The authors’ response to reviewer 1 asking about baseline comparisons in Table 2 and 3 is that that there are only 11 molecules with all 5 fully labelled properties, and it is not meaningful to train an SSVAE model with 11 fully labelled and 372,199 unlabelled molecules. However, the authors state that they have 372,210 Mol.Wt, 310,000 LogP, 310,000 QED, 10 55,748 EA, and 52,346 IE labels available. This means that the authors can train 5 separate SSVAE models for each property, for which the authors do have sufficient data. If the ConGen model performs worse than SSVAE trained separately on each property, this means that ConGen is not able to benefit from training on multiple properties and utilize a larger training set, in which case there would be no technical contribution from this paper.

While the authors made a substantial effort to address the reviewer comments, the main technical contribution of this paper is not sufficiently supported, even after both the reviewers asked for evidence. For this, I am unable to recommend this manuscript for publication.

----------------------------
We thank the reviewer for this comment and important feedback. We have indirectly demonstrated the advantage of multi-property ConGen over single-property ConGen in the previous revision of the manuscript (Supplementary Figure 1), for an example dataset with 3 property constraints (Mol.Wt, LogP, and QED which is one of the two databases used in main text Table 2). By virtue of model architecture, the single-property ConGen is essentially identical to the single-property SSVAE, and in this case the single-property SSVAE will suffer from the low acceptance rate problem encountered by the single-property ConGen generation seen in Supplementary Figure 1.

Nevertheless, to make this single-property SSVAE disadvantage immediately be more obvious to the readers, we have performed an additional baseline comparison using three single-property SSVAE models (Table 2 ConGen generation query only uses 3 property constraints). We train three single-property SSVAE models (Mol.Wt, LogP, and IE) using the entire molecules in the database, where we have different number of labelled and unlabelled molecules for each of the three properties. We then generate 10 molecules for each model and validate all the properties using RDKit (Mol.Wt, LogP, QED) and quantum chemistry (EA, IE). We then compare the properties of the generated molecules, and show that while these single-property SSVAE models can have good control over the corresponding single property they are trained with, they are not constrained on the other properties they are not trained on, making them unsuitable for molecule generation tasks when multi-property control is needed (see Supplementary Table 3). In addition, we have also performed another baseline comparison using multi-property SSVAE as requested by Reviewer #1 in Supplementary Figure 2.
Main text page 16, line 10. We modified the sentence:

Further discussion on the benefit of multi-constraint conditional generative model over single-constraint conditional generative model, as well as additional comparisons with the baseline SSVAE model, can be found in the Supplementary Information.

Supplementary Information page 12, line 1. We added the section:

ConGen Advantage Compared to SSVAE on Incomplete-Labelled Dataset

In the main text, we have mentioned that the primary advantage of ConGen is that it can work with molecule databases with incomplete labels, which is not suitable for a baseline SSVAE model. Suppose we would like to utilize the SSVAE model for multi-condition generative modelling tasks using training molecule databases with incomplete labels regardless, for comparison purposes. We can perform this task with two different approaches on the SSVAE model before comparing its performance to a ConGen model trained on the same datasets:

1. Use all molecules with full labels as fully labelled training dataset and designate any molecules with incomplete labels as fully unlabelled training dataset.
2. Train individual SSVAE models with only single-property label each, ensuring that each SSVAE model can utilize all the training property labels.

In the first approach, we allow the SSVAE model to perform multi-constraint molecule generation, in exchange for a significant loss of property training data label information. It is impossible to test this approach on the main text’s training datasets because of the extremely low availability of fully labelled molecules in our dataset. We have instead utilized the full ZINC database from the main text (containing 310,000 unique molecules) with fully labelled Mol.Wt, LogP, and QED properties (obtained using RDKit). 300,000 of these molecules are designated as the training dataset, while the remaining 10,000 molecules are designated as the test dataset. For each property label column in the training dataset, we randomly de-label 70% of the properties. Consequently, only ~2.7% of the molecules in our training dataset are fully labelled (8,117 fully labelled + 291,883 fully unlabelled molecules). This reflects the severe consequence of randomness we typically encounter from available experimental databases when multiple properties are needed. We train the SSVAE model of Kang, et al on this dataset, and perform a query to generate 10,000 molecules in multi-constraint mode (Mol.Wt = 250, LogP = 2.5, and QED = 0.55). Note that the original SSVAE model as published by Kang, et al only supports single-constraint molecule generation mode, and we have slightly modified the SSVAE model’s molecule generation function to enable multi-constraint generation mode the exact same way it is being done in ConGen. We also train the ConGen model on the same partially de-labelled training dataset (ConGen can utilize all ~70% remaining partial labels) and perform a query to generate 10,000 molecules with same multi-property constraints as specified above. Out of the 10,000 SSVAE molecules, 488 molecules are within the training dataset and we obtain 1,347 unique molecules outside of the training dataset. On the other hand, ConGen generates 239 molecules which are within the training dataset and 5,988 unique molecules outside of the training dataset. If our tolerance criterion is 20% relative error, we obtain 11 and 1,545 acceptable molecules from the SSVAE and ConGen approaches respectively, corresponding to 0.8% and 25.8% acceptance rate for the two approaches. If our tolerance criterion is 10% relative error (stricter), we only obtain 2 and 295 acceptable molecules from the SSVAE and ConGen approaches respectively, corresponding to 0.1% and 4.9% acceptance rate for the two approaches. See Supplementary Figure 2 below for more details.

(insert Supplementary Figure 2)

In the second approach, we restrict the individual SSVAE model training to be done on only one property each. In this example, we can utilize the training dataset we have previously used in the main text Table 2 and perform direct comparison with the molecules generated by the ConGen approach. We train three individual SSVAE models for each of the three property constraints (Mol.Wt = 250 Da, LogP = 2.5, and IE = 5 eV), and then generate 10 molecules for each SSVAE model’s validation using the corresponding single-property constraint. Note that the original SSVAE model as published by Kang, et al will generate programming errors when trained on single properties (the code as published was trained on multiple properties and can only be used for single-property generation tasks), and some minor code reprogramming is needed to enable the model to work on single-property training and molecule generation tasks. The result is show in Supplementary Table 3, where we show that the individual SSVAE models separately trained on Mol.Wt and IE have good control over the individual property of the molecule it generates, but surprisingly the model separately trained on the single LogP property has bad performance on LogP molecule generation task (LogP = 2.35 ± 1.12). While the individual SSVAE models have good performance on the single property constraint it was trained on, they are not constrained on the other two properties they are not trained on. Consequently, these molecules generated by these single-constraint SSVAE’s are not suitable for satisfying the requirement of multi-constraint generation queries, compared to the ConGen model.

(insert Supplementary Table 3)

----------------------------

Reviewer #1:

Authors have answered most of my questions but still requires some work before it can be published.
1. In the section ‘ConGen Advantage on Multi-Condition Generative Design “
a. In generated unique molecules from the multi- constrained and single constraint, author needs to tell how many of these unique molecules are not present in the training set . For multi- constraint says 14628 unique molecule are generated but does not mention how many of them matches molecules from training data
b. is single constraint molecules generated using authors proposed ConGen model or SSVAE

---------
We thank the reviewer for this comment.
The unique molecule counts in our manuscript (14,628 from multi-property ConGen and 68,925 from single-property ConGen) already exclude molecules which also exist in the training dataset.

The single-constraint molecules are generated using the ConGen model, which is essentially identical to the SSVAE model architecture for single-property generation. We had to use the ConGen for the training because SSVAE model can only accommodate fully labelled and fully unlabelled molecules. In response to the request from Reviewer #3, have also added an additional baseline comparison directly utilizing multiple single-property SSVAE models. Please see the reply to point #2 for this additional baseline comparison.

Supplementary Information page 10, line 7. We modified the sentences:
Out of these 90,000 molecules, 4,257 are within the training dataset and we generate a total of 14,628 unique molecules outside of the training dataset. In the single-constraint approach, we query the model 30,000 times for each of the single constraint. Out of these 90,000 molecules, 3,928 are within the training dataset and we generate a total of 68,925 unique molecules outside of the training dataset.

--------------------
2. Lack of Baseline Comparison: For results in table 2, I understand due to extremely low labelled examples for all properties it is not possible to train SSVAE for baseline comparison . However, here for baseline comparison can author only use properties of the dataset that is easily available (say using RDKit) and then randomly partition them into labeled/unlabelled or labeled/unlabelled/partially labelled for training to compare SSVAE and ConGen

-----------------------
We thank the reviewer for this comment and important feedback. We have now added an additional baseline comparison where the baseline SSVAE model is trained on a new database, comprised of three RDKit properties (Mol.Wt, LogP, and QED). We then corrupt the database (consisting of 310k molecules) by randomly removing 70% of the labels for each property. The ConGen model can utilize all the remaining 70% labels, while the SSVAE model can only utilize 2.7% of the labels because this is the statistical proportion of the fully labelled molecules within the training database. There are two non-ideal factors in this new baseline comparison scheme:
RDKit properties will be ‘easier’ to learn because they can be directly computed (Mol.Wt for example, is just the atoms’ weight summation), so having access to several hundred thousand training data labels may or may not confer much additional benefit vs having access to several thousand training data labels.
30% remaining label fraction being chosen here may not reflect realistic situations (properties such as IE, EA, and Log.Vis in our databases have around 3% and 0.3% label availabilities for the work in our main text Figure 3). Nevertheless, this 30% proportion is chosen to allow SSVAE model access to at least several thousand fully labelled molecules in the training database. Otherwise, we will encounter the original problem with extremely limited number of available fully labelled molecules for the SSVAE model training.
We also note that while the original baseline SSVAE code provided by Kang, et al can be simultaneously trained on multiple properties, it can only be used for single-property molecule generation. We believe it is not meaningful to intentionally remove partial labels from the training data (to make SSVAE-compatible fully labelled/fully unlabelled partition) only to then run the SSVAE model in a single-property generation mode. Hence, we have slightly improved the baseline SSVAE code to enable multi-property generation just like our multi-property ConGen (the training however, can only use fully labelled and fully unlabelled molecules the way baseline SSVAE model is normally trained). We then provide two baselines, for both baseline SSVAE running in single-constraint generation (following the request from Reviewer #3) and baseline SSVAE running in multi-constraint generation mode.

Supplementary Information page 12, line 1. We added the section:

ConGen Advantage Compared to SSVAE on Incomplete-Labelled Dataset
In the main text, we have mentioned that the primary advantage of ConGen is that it can work with molecule databases with incomplete labels, which is not suitable for a baseline SSVAE model. Suppose we would like to utilize the SSVAE model for multi-condition generative modelling tasks using training molecule databases with incomplete labels regardless, for comparison purposes. We can perform this task with two different approaches on the SSVAE model before comparing its performance to a ConGen model trained on the same datasets:

1. Use all molecules with full labels as fully labelled training dataset and designate any molecules with incomplete labels as fully unlabelled training dataset.
2. Train individual SSVAE models with only single-property label each, ensuring that each SSVAE model can utilize all the training property labels.

In the first approach, we allow the SSVAE model to perform multi-constraint molecule generation, in exchange for a significant loss of property training data label information. It is impossible to test this approach on the main text’s training datasets because of the extremely low availability of fully labelled molecules in our dataset. We have instead utilized the full ZINC database from the main text (containing 310,000 unique molecules) with fully labelled Mol.Wt, LogP, and QED properties (obtained using RDKit). 300,000 of these molecules are designated as the training dataset, while the remaining 10,000 molecules are designated as the test dataset. For each property label column in the training dataset, we randomly de-label 70% of the properties. Consequently, only ~2.7% of the molecules in our training dataset are fully labelled (8,117 fully labelled + 291,883 fully unlabelled molecules). This reflects the severe consequence of randomness we typically encounter from available experimental databases when multiple properties are needed. We train the SSVAE model of Kang, et al on this dataset, and perform a query to generate 10,000 molecules in multi-constraint mode (Mol.Wt = 250, LogP = 2.5, and QED = 0.55). Note that the original SSVAE model as published by Kang, et al only supports single-constraint molecule generation mode, and we have slightly modified the SSVAE model’s molecule generation function to enable multi-constraint generation mode the exact same way it is being done in ConGen. We also train the ConGen model on the same partially de-labelled training dataset (ConGen can utilize all ~70% remaining partial labels) and perform a query to generate 10,000 molecules with same multi-property constraints as specified above. Out of the 10,000 SSVAE molecules, 488 molecules are within the training dataset and we obtain 1,347 unique molecules outside of the training dataset. On the other hand, ConGen generates 239 molecules which are within the training dataset and 5,988 unique molecules outside of the training dataset. If our tolerance criterion is 20% relative error, we obtain 11 and 1,545 acceptable molecules from the SSVAE and ConGen approaches respectively, corresponding to 0.8% and 25.8% acceptance rate for the two approaches. If our tolerance criterion is 10% relative error (stricter), we only obtain 2 and 295 acceptable molecules from the SSVAE and ConGen approaches respectively, corresponding to 0.1% and 4.9% acceptance rate for the two approaches. See Supplementary Figure 2 below for more details.

(insert Supplementary Figure 2)

In the second approach, we restrict the individual SSVAE model training to be done on only one property each. In this example, we can utilize the training dataset we have previously used in the main text Table 2 and perform direct comparison with the molecules generated by the ConGen approach. We train three individual SSVAE models for each of the three property constraints (Mol.Wt = 250 Da, LogP = 2.5, and IE = 5 eV), and then generate 10 molecules for each SSVAE model’s validation using the corresponding single-property constraint. Note that the original SSVAE model as published by Kang, et al will generate programming errors when trained on single properties (the code as published was trained on multiple properties and can only be used for single-property generation tasks), and some minor code reprogramming is needed to enable the model to work on single-property training and molecule generation tasks. The result is show in Supplementary Table 3, where we show that the individual SSVAE models separately trained on Mol.Wt and IE have good control over the individual property of the molecule it generates, but surprisingly the model separately trained on the single LogP property has bad performance on LogP molecule generation task (LogP = 2.35 ± 1.12). While the individual SSVAE models have good performance on the single property constraint it was trained on, they are not constrained on the other two properties they are not trained on. Consequently, these molecules generated by these single-constraint SSVAE’s are not suitable for satisfying the requirement of multi-constraint generation queries, compared to the ConGen model.

(insert Supplementary Table 3)

-----------------------
Minor Correction:
1. page 20 line 15 Spelling error of is used twice “If we are interested in generating a chemical space satisfying a number of of “

----------------------
We thank the reviewer for this feedback. The mistake has now been corrected.
Main text page 4, line 15. We fixed the sentence:
If we are interested in generating a chemical space satisfying several of these constraints, many of the molecules found in publicly available databases cannot be used as the fully labelled training data for the SSVAE model.

-----------------------
2. Page 32: After this sentence no explanation is provided why author expected such outcome “We have expected the transferred BERT-based ConGen to perform worse than the RNN-based ConGen on abundant property label such as Mol.Wt and LogP and better than RNN- based ConGen on rare property label such as IE. “

-------------------
We thank the reviewer for this feedback. We have added the follow-up reasoning for this hypothesis.
Main text page 16, line 18. We added the sentence:
We expected this outcome because in general, training a model from scratch is advantageous when enough training data is available while a pre-trained model performs better when there is insufficient training data.

Editor’s decision letter

03-Aug-2023

Dear Dr Mailoa:

Manuscript ID: DD-ART-04-2023-000064.R1
TITLE: Multi-Constraint Molecular Generation using Sparsely Labelled Training Data for Localized High-Concentration Electrolyte Diluent Screening

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 4

The author proposed a new multi-conditional molecules generation model as an improvement of the existing SSVAE, as it could take partially labeled chemistry dataset for training. The new model is then validated in the task of designing molecules for LHCE diluents. This work should be considered as a noticeable improvement of existing methods. However, there are few concerns and minor corrections should be addressed before publication:

1. When comparing the RNN-based and (pre-trained) BERT-based ConGen model, there are two entangled factors: neural network architecture and pre-trained model.

a. For the architecture, attention-based VAE (bert-based ConGen, in this case) is known to perform better than RNN in many ways, except the computational cost. Author does not explain why attention-based architecture is expected to perform worse in this case.
b. The author does a good job in explaining why pre-trained models performed worse in this case. However, the data cannot support this conclusion because RNN-based ConGen vs. pre-trained BERT-based ConGen has significantly different architectures.

Thus I suggest
a. Since the ChemBERTa is pre-trained on molecules that have previously been shown in the literature, its output should have better stability and synthesizability. Taking these dimensions into consideration could make the comparison more complete. For example, consider the Synthetic Accessibility Score or similar metrics.

b. Add the discussion between training-from-scratch BERT-based models to the main text. It is worth mentioning that the training-from-scratch BERT-based model outperformed the RNN-based model in some Conditional Generation tasks, based on the table in supplemental section Impact of ConGen BERT Variations and Training from Scratch.

2. It will make the work more self-consistent if the author could propose a filtering mechanism. Right now, the query generates too many data points without enough stability, synthesizability and novelty. Without a valid filtering method, the author failed to validate those proposed candidates (27838 candidates!) with quantum chemistry methods because of the high computational cost.

3. [Minor correction] Table 4 is a very nice representation of the advantage of the ConGen model. However there are way too many candidates and the author didn’t explain why those are chosen. It would be nice if the author could list the chemical structure of these candidates and only present the most promising ones, instead of all of them. I am confident this will be very helpful for the whole lithium battery community.

4. [Minor correction] In section ‘Baseline SSVAE Model’, please explain how the molecule structure is embedded. I am assuming the model is using SMILES and one-hot encoding but it is not stated in the main text.

5. [Minor correction] In section ‘Baseline SSVAE Model’: ‘where a beam search algorithm is used for converting output to a molecule SMILES’. Please explain how beam search is used here. Per my understanding, converting $x_D$ to SMILES is trivial enough that a beam search is not necessary. The original SSVAE model used beam search, but only for converting the probability to $x_D$ not $x_D$ to SMILES.

Author response

We thank the editorial team for the decision regarding our manuscript submission. Please find our response below, which has addressed all of the reviewer's latest feedback.

Thank you very much and best regards
Jonathan
------------------------------------

Reviewer #4:

The author proposed a new multi-conditional molecules generation model as an improvement of the existing SSVAE, as it could take partially labeled chemistry dataset for training. The new model is then validated in the task of designing molecules for LHCE diluents. This work should be considered as a noticeable improvement of existing methods. However, there are few concerns and minor corrections should be addressed before publication:

-------------------
We thank the reviewer for this comment.
-------------------

When comparing the RNN-based and (pre-trained) BERT-based ConGen model, there are two entangled factors: neural network architecture and pre-trained model.
For the architecture, attention-based VAE (bert-based ConGen, in this case) is known to perform better than RNN in many ways, except the computational cost. Author does not explain why attention-based architecture is expected to perform worse in this case.
The author does a good job in explaining why pre-trained models performed worse in this case. However, the data cannot support this conclusion because RNN-based ConGen vs. pre-trained BERT-based ConGen has significantly different architectures.

Thus I suggest
Since the ChemBERTa is pre-trained on molecules that have previously been shown in the literature, its output should have better stability and synthesizability. Taking these dimensions into consideration could make the comparison more complete. For example, consider the Synthetic Accessibility Score or similar metrics.

--------------------------------------
We thank the reviewer for this comment and important feedback. We have certainly neglected taking synthesizability and stability metric into account in our previous version of the manuscript. This aspect has now been added into the main text.

Main text page 16, line 15. We modified the sentence:

We have expected the transferred BERT-based ConGen to perform worse than the RNN-based ConGen on abundant property label such as Mol.Wt and LogP (simple properties to learn) and better than RNN-based ConGen on rare property label such as IE (complex property to learn).
Main text page 17, line 4. We added the sentence:

We note however, that the molecules generated by BERT-based ConGen has slightly better Synthetic Accessibility Score (SA score = 2.42 ± 0.63) compared to the ones generated by RNN-based ConGen (SA score = 2.52 ± 0.70). This is likely because the ChemBERTa model is pre-trained on molecules which have previously been shown in literature, making it more likely that these molecules are more synthesizable.
-------------------------------------

Add the discussion between training-from-scratch BERT-based models to the main text. It is worth mentioning that the training-from-scratch BERT-based model outperformed the RNN-based model in some Conditional Generation tasks, based on the table in supplemental section Impact of ConGen BERT Variations and Training from Scratch.

-------------------------------------
We thank the reviewer for this feedback. We have included a short discussion about training-from-scratch BERT-based ConGen into the main text.
Main text page 16, line 22. We added the sentence:

We note that re-training the BERT-based ConGen from scratch significantly hurts its property prediction performance, although its conditional generation capability on LogP is still better than one of our less-optimized RNN-based ConGen (see Supplementary Information).
------------------------------------------------

It will make the work more self-consistent if the author could propose a filtering mechanism. Right now, the query generates too many data points without enough stability, synthesizability and novelty. Without a valid filtering method, the author failed to validate those proposed candidates (27838 candidates!) with quantum chemistry methods because of the high computational cost.

------------------------------------
We thank the reviewer for this comment and important feedback. We have now added a proposal for filtering mechanism. Please see our response to point #3 for further related manuscript edits.

Main text page 26, line 4. We added the sentence:

Because the ConGen model can produce a relatively large number of unique LHCE diluent candidates despite having to satisfy multiple property constraints, it becomes difficult to validate all of them either computationally or experimentally. We propose a filtering mechanism based on synthesizability (only consider molecules with SA score < 3.0) and novelty (multiple clustering queries based on molecular fingerprints to further select 100 molecules, followed by manual selection) and suggest the following unique 35 molecules for further investigation in the future as LHCE diluent molecules in Table 4, out of all the molecules we have generated in Figure 4b, 4c, and Figure 5. Generative model filtering mechanism is not the focus of our work, and we encourage interested readers to develop their own filtering criteria from the full list of candidate molecules (including those such as flammability, toxicity, etc) which can be found in the electronically available Supplementary Information.
-----------------------------------

[Minor correction] Table 4 is a very nice representation of the advantage of the ConGen model. However there are way too many candidates and the author didn’t explain why those are chosen. It would be nice if the author could list the chemical structure of these candidates and only present the most promising ones, instead of all of them. I am confident this will be very helpful for the whole lithium battery community.

--------------------------------------
We thank the reviewer for this comment and important feedback. Table 4 molecules are not actually ‘chosen’, as they are merely all the molecules generated by Query 1 in Figure 4b, where we have validated all of their properties except for Log.Vis, which we are unable to do experimentally. However, we believe the reviewer’s suggestion makes sense. We have moved the Table 4 into Supplementary Information. We have also combined all the LHCE diluent candidates produced by our model through the different queries (Figure 4b, 4c, and Figure 5) and we have added a new Table 4 into the main text containing the most promising candidates, based on novelty and synthesizability (SA score).

Main text page 21, line 16. We modified the sentence:
Nevertheless, we have listed all the molecules that the ConGen model has generated based on their property label input anchors in Supplementary Information for future validation by other research groups with experimental capabilities.

The previous Table 4 has been moved from main text to Supplementary Table 4.

We added a new Table 4 to the main text.

[New Table 4 inclusion in the main text]
---------------------------------------------

[Minor correction] In section ‘Baseline SSVAE Model’, please explain how the molecule structure is embedded. I am assuming the model is using SMILES and one-hot encoding but it is not stated in the main text.

----------------------------------------------
We thank the reviewer for this reminder. We have now included the requested explanation on the input embedding representation.

Main text page 7, line 1. We modified the sentence:

In SSVAE approach, the molecule input representation SMILES is converted into input embedding x using one-hot encoding. A molecule entry’s training cost function needs to be split into three parts (Equation 1-3).
-----------------------------------------------

[Minor correction] In section ‘Baseline SSVAE Model’: ‘where a beam search algorithm is used for converting output to a molecule SMILES’. Please explain how beam search is used here. Per my understanding, converting $x_D$ to SMILES is trivial enough that a beam search is not necessary. The original SSVAE model used beam search, but only for converting the probability to $x_D$ not $x_D$ to SMILES.

------------------------------------------------
We thank the reviewer for this correction. We have clarified our mention of the usage of beam search algorithm within the SSVAE model.

Main text page 8, line 5. We modified the sentence:

The beam search algorithm is used to efficiently convert the probabilities into the most likely output sequence x_D (based on breadth-first tree search mechanism), which is then easily converted to the output molecule SMILES.
--------------------------------------------------

Editor’s decision letter

14-Aug-2023

Dear Dr Mailoa:

Manuscript ID: DD-ART-04-2023-000064.R2
TITLE: Multi-Constraint Molecular Generation using Sparsely Labelled Training Data for Localized High-Concentration Electrolyte Diluent Screening

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

Reviewer comments

Reviewer 4

Author has made notable work in terms of addressing the comments. The introduction of new metrics (synthetic accessibility) made this work more comprehensive. The modification of Table.4 is a step forward from the aspect of presentation.

From the journal Digital Discovery Peer review history

Round 1

Reviewer 1

Reviewer 2

Round 2

Reviewer 3

Reviewer 1

Round 3

Reviewer 4

Round 4

Reviewer 4

Transparent peer review