From the journal Digital Discovery Peer review history

You do not have JavaScript enabled. Please enable JavaScript to access the full features of the site or access our non-JavaScript page.

Round 1

Manuscript submitted on 29 Nov 2021

Editor’s decision letter

11-Feb-2022

Dear Dr Schrier:

Manuscript ID: DD-ART-11-2021-000044
TITLE: Predicting compositional changes of organic-inorganic hybrid materials with Augmented CycleGAN

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy from CASRAI, https://casrai.org/credit/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 1

His paper describes a clever use of machine learning to help find new inorganic/organic hybrid materials. The study adopts a type of generative models I knew little about before reading this paper, CycleGANs, and convincingly demonstrate how it can be used for materials discovery. The approach is thoughtfully validated against baselines (e.g., just assuming the inorganic composition does not change with the organic component) and they both illustrate the application of the method to their particular target system and explain how it could be used elsewhere. My only concern is that the presentation of this paper could be improved to better insure people who are newer to generative models will understand the importance of this work. I describe some suggestions below.
Some understanding of generative machine learning is already needed in the introduction to the paper. Most content focused on that “GANs are new and exciting but have not quite been used to predict compositions of materials.” These are important background points but the impact of the paper on the actual material application, finding new AMTOs, could be brought forwarded sooner and more clearer to the reader.
The explanation of GANs could also be improved by explaining terms better. For example, explaining that “G_AB” is the generator which takes a composition from Amine A and translates it to a composition for Amine B would go far in helping the reader decipher how CycleGANs work. The explanation was sufficient in clarity for me, but I teach courses on the subject of generative models and am likely an easier sell than most readers. A few minor changes such as explaining notation and providing intuitive explanation to the loss functions (e.g., “the cycle loss provides more training information by ensuring the pair of generators can transform an instance back”).
Also, a point of my curiosity that I would like the paper to address: Do I need a new GAN for each pair of amines? If so, would you suspect there are benefits for transfer learning or “conditioning” the GAN to produce predictions using the structure of the organic phase as inputs?
Beyond these points, I commend the authors for a fascinating paper.

Reviewer 2

I reviewed the data/code aspect of the submission. Generally, the code is well organized which helps a lot for reproducibility. The following are my additional comments.

1) It would help if docstrings are provided for all functions and classes.
2) Currently, the code appears to be scattered into directories, and the user has to be in the directory to use the code. Can the authors make the code a proper package such that one can access it from one namespace? For example, let's say the package name is cacgan, I would prefer to access the analysis module by `cacgan.analysis`. Repos like https://github.com/hackingmaterials/matminer may give you some references. Note that this is just an example, you do not have to cite it.
3) some example notebooks of the software usage and training would also help.
4) I suggest the authors make proper tests of the code and use continuous integration such as github actions to do automatic tests.

Reviewer 3

This manuscript by Qianxiang Ai et al. presents the application of Augmented CycleGAN in the prediction of composition change of organic-inorganic hybrid materials. Significance is that, with the example of generating realistic compositions for the hybrid organic-inorganic crystals, the author demonstrated the power of the method Augmented CycleGAN that could have important applications with unpaired data in the field of Chemistry and Materials science. The work is novel and very interesting, the methodologies are sound and the results are reliable. The writing is excellent. I suggest publication of this work in the journal Digital Discovery in its current form.

Minor comments:
The authors used the Augmented CycleGAN to find the mappings between the two groups of CNC-templated and NCCN-templated structures, and generate or predict the compositions of the NCCNs. Is it possible to simply apply GAN to the single domain of NCCN-templated structure and then generate the compositions?

Author response

Response to the reviewers

We thank the reviewers for their insightful comments. Comments are addressed individually in the following, and the changes made to the manuscript are highlighted in blue (in the .docx document).

RESPONSE TO REFEREE 1

REFEREE 1:
His paper describes a clever use of machine learning to help find new inorganic/organic hybrid materials. The study adopts a type of generative models I knew little about before reading this paper, CycleGANs, and convincingly demonstrate how it can be used for materials discovery. The approach is thoughtfully validated against baselines (e.g., just assuming the inorganic composition does not change with the organic component) and they both illustrate the application of the method to their particular target system and explain how it could be used elsewhere. My only concern is that the presentation of this paper could be improved to better insure people who are newer to generative models will understand the importance of this work. I describe some suggestions below.
> Authors’ response:
We thank the referee for the positive assessment of our work and appreciate their clear advice for helping us improve our manuscript, which we address below.

COMMENT 1:
Some understanding of generative machine learning is already needed in the introduction to the paper. Most content focused on that “GANs are new and exciting but have not quite been used to predict compositions of materials.” These are important background points but the impact of the paper on the actual material application, finding new ATMOs, could be brought forwarded sooner and more clearer to the reader.
> Authors’ response to comment 1:
We thank the referee for this suggestion. A new paragraph has been prepended to the first section to prioritize the introduction of ATMO:
Organic-inorganic hybrid crystalline materials are a wide class of functional materials that encompasses halide perovskites,1–3 metal organic frameworks (MOFS),4,5 and templated metal oxides.6 The subclass of amine-templated metal oxides (ATMOs) have been a research focus of structural chemistry due to the intricate interactions between their inorganic building units and amine templates.7–11 The great structural diversity found in ATMOs (exemplified by the amine-templated zinc phosphate structures of four different dimensionalities), can only be matched by their compositional diversity (71 elements, 25 main group building units, and 349 amines as of 2021).12 This immense chemical space, along with various types of possible interactions, makes it extremely challenging to predict the properties of novel ATMOs.

COMMENT 2:
The explanation of GANs could also be improved by explaining terms better. For example, explaining that “G_AB” is the generator which takes a composition from Amine A and translates it to a composition for Amine B would go far in helping the reader decipher how CycleGANs work. The explanation was sufficient in clarity for me, but I teach courses on the subject of generative models and am likely an easier sell than most readers. A few minor changes such as explaining notation and providing intuitive explanation to the loss functions (e.g., “the cycle loss provides more training information by ensuring the pair of generators can transform an instance back”).
> Authors’ response to comment 2:
We thank the referee for this suggestion. A few explanatory/clarifying sentences have been added to section 3, including:
[section 3, 1st paragraph]: G_{AB} takes a composition vector of amine A (C_A) and translates it to a composition vector of amine B (C_B^\prime). A prime is used to denote generated composition vectors throughout this paper. Similarly, G_{BA} translates C_B to C_A^\prime.
[section 3, 2nd paragraph]: The generator G_{AB} is trained to minimize L_{GAN-B}, while D_B is trained to maximize it.
[section 3, 2nd paragraph]: Here, reconstruction means to transform a generated sample using another generator. For example, G_{AB}{(G}_{BA}(C_B)) is the reconstruction of C_B from G_{BA}(C_B). Minimizing cycle-consistency loss makes the reconstructed sample close to the original sample, which reduces the number of possible mappings produced by the generators.

COMMENT 3:
Also, a point of my curiosity that I would like the paper to address: Do I need a new GAN for each pair of amines? If so, would you suspect there are benefits for transfer learning or “conditioning” the GAN to produce predictions using the structure of the organic phase as inputs?
> Authors’ response to comment 3:
As amine identity plays a role in the structure formation of ATMOs, a new augmented CycleGAN should be trained for a different amine (pair). On the other hand, different amines may also behave similarly in crystallization (a trivial example: deuterium substituted amines), supporting the use of transfer learning. There is ongoing effort in our group to find a such similarity measure for amines in structural formation.
A predictive/generative model using amine as (part of) the input is certainly desirable. We are, however, discouraged by the imbalanced dataset: while there are 349 different amines in our dataset, 5 most popular amines account for around 35% of all reported structures, and 243 amines (around 70% of all amines) are considered unpopular (with less than 5 structures). Furthermore, the unpopular amines can be chemically very different from the popular ones (e.g. porphyrin). This raises the concern that if the (conditioned) GAN trained on this dataset can generate samples fairly (i.e., not biased by the popular amines).
To address these points, we added the following in the main text:
[section 4, 5th paragraph]: As amine identity plays a role in the structure formation of ATMOs, a new Augmented CycleGAN should be trained if a different amine pair is selected. A more general solution for generating ATMO compositions would be a generative model conditioned on both amine identity and chemical system (in contrast to the current model, which is conditioned by the chemical system of input compositions). One challenge is the highly imbalanced ATMO dataset: While there are 349 different amines in our dataset, the 5 amines that appear most frequently account for around 35% of all reported structures, 243 amines (nearly 70% of all amines in the dataset) appear in fewer than 5 structures each, and 159 amines (around 46% of all amines) have only one reported structure. Furthermore, the underrepresented amines (e.g., porphyrin, found in only 8 structures) can be chemically very different from the popular ones (the 5 most frequent amines are short, aliphatic amines). This raises the possible concern that such a general generator model, trained on this severely imbalanced dataset, may not learn from the minority classes, and for this reason we have not studied this more general problem in the current paper.

RESPONSE TO REFEREE 2
REFEREE 2:
I reviewed the data/code aspect of the submission. Generally, the code is well organized which helps a lot for reproducibility. The following are my additional comments.
Authors’ response:
We thank the referee for reviewing our data/code and the positive assessment. The comments are addressed in the following

COMMENT 1.
It would help if docstrings are provided for all functions and classes.
> Authors’ response to comment 1:
We thank the referee for this suggestion. Docstrings for classes and functions have been provided in the latest version at https://github.com/qai222/CompAugCycleGAN/.

COMMENT 2.
Currently, the code appears to be scattered into directories, and the user has to be in the directory to use the code. Can the authors make the code a proper package such that one can access it from one namespace? For example, let's say the package name is cacgan, I would prefer to access the analysis module by `cacgan.analysis`. Repos like https://github.com/hackingmaterials/matminer may give you some references. Note that this is just an example, you do not have to cite it.
> Authors’ response to comment 2:
We thank the referee for this suggestion and the reference repository. All modules have been moved to the namespace of `cacgan`, and all other scripts, such as dataset generation and training, have been moved to the `scripts` folder.

COMMENT 3.
some example notebooks of the software usage and training would also help.
> Authors’ response to comment 3:
This is a great suggestion. We have created a notebook illustrating dataset generation and model training which can be accessed at https://github.com/qai222/CompAugCycleGAN/blob/main/scripts/tutorial.ipynb. The following is added to the manuscript:
[Data availability, 1st paragraph]: A notebook illustrating dataset generation and model training is included in the repository at https://github.com/qai222/CompAugCycleGAN/blob/main/scripts/tutorial.ipynb. Testing scripts are placed at https://github.com/qai222/CompAugCycleGAN/tree/main/scripts.

COMMENT 4.
I suggest the authors make proper tests of the code and use continuous integration such as github actions to do automatic tests.
> Authors’ response to comment 4:
We thank the referee for this suggestion. Test scripts have been added in the repository at https://github.com/qai222/CompAugCycleGAN/tree/main/scripts. The current repository marks a finished project and we do not intend to update any of its functions. In addition, most of model training and use relies on GPUs, which are not available on the built-in continuous integration runners GitHub provides (we have not yet explored third-party services such as CircleCI or Azure integration). However, we will make use of continuous integrate if we are to extend this method in the future.

RESPONSE TO REFEREE 3
REFEREE 3:
This manuscript by Qianxiang Ai et al. presents the application of Augmented CycleGAN in the prediction of composition change of organic-inorganic hybrid materials. Significance is that, with the example of generating realistic compositions for the hybrid organic-inorganic crystals, the author demonstrated the power of the method Augmented CycleGAN that could have important applications with unpaired data in the field of Chemistry and Materials science. The work is novel and very interesting, the methodologies are sound and the results are reliable. The writing is excellent. I suggest publication of this work in the journal Digital Discovery in its current form.
> Authors’ response:
We thank the referee for highlighting the significance of our study and the positive comment.

COMMENT 1.
Minor comments: The authors used the Augmented CycleGAN to find the mappings between the two groups of CNC-templated and NCCN-templated structures, and generate or predict the compositions of the NCCNs. Is it possible to simply apply GAN to the single domain of NCCN-templated structure and then generate the compositions?
> Authors’ response to comment 1:
It is certainly possible to apply GAN to a single domain. However, such a GAN would be limited by the diversity of datapoints from this domain. Using NCCN-templated structures as an example, it is likely that the trained GAN can reliably generate compositions of popular chemical systems, but gives unrealistic compositions of a chemical system that is absent in NCCN-templated structures.
To stress this point, the following is added to the main text:
[section 2, 2nd paragraph]: Figure 2 also suggests the limitation of training a generator on one structure group (C_A or C_B) only: such a generator would not be able to generate compositions of chemical systems that are absent in the original structure group. Using data from two (or multiple) structure groups, extrapolations can be made to chemical systems that are absent in one particular structure group.

Round 2

Revised manuscript submitted on 22 Feb 2022

Editor’s decision letter

01-Mar-2022

Dear Dr Schrier:

Manuscript ID: DD-ART-11-2021-000044.R1
TITLE: Predicting compositional changes of organic-inorganic hybrid materials with Augmented CycleGAN

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

Thank you for publishing with Digital Discovery, a journal published by the Royal Society of Chemistry – connecting the world of science to advance chemical knowledge for a better future.

With best wishes,

Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

******
******

Please contact the journal at digitaldiscovery@rsc.org

************************************

DISCLAIMER:

This communication is from The Royal Society of Chemistry, a company incorporated in England by Royal Charter (registered number RC000524) and a charity registered in England and Wales (charity number 207890). Registered office: Burlington House, Piccadilly, London W1J 0BA. Telephone: +44 (0) 20 7437 8656.

The content of this communication (including any attachments) is confidential, and may be privileged or contain copyright material. It may not be relied upon or disclosed to any person other than the intended recipient(s) without the consent of The Royal Society of Chemistry. If you are not the intended recipient(s), please (1) notify us immediately by replying to this email, (2) delete all copies from your system, and (3) note that disclosure, distribution, copying or use of this communication is strictly prohibited.

Any advice given by The Royal Society of Chemistry has been carefully formulated but is based on the information available to it. The Royal Society of Chemistry cannot be held responsible for accuracy or completeness of this communication or any attachment. Any views or opinions presented in this email are solely those of the author and do not represent those of The Royal Society of Chemistry. The views expressed in this communication are personal to the sender and unless specifically stated, this e-mail does not constitute any part of an offer or contract. The Royal Society of Chemistry shall not be liable for any resulting damage or loss as a result of the use of this email and/or attachments, or for the consequences of any actions taken on the basis of the information provided. The Royal Society of Chemistry does not warrant that its emails or attachments are Virus-free; The Royal Society of Chemistry has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Please rely on your own screening of electronic communication.

More information on The Royal Society of Chemistry can be found on our website: www.rsc.org

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.