From the journal Digital Discovery Peer review history

Element similarity in high-dimensional materials representations

Round 1

Manuscript submitted on 30 Jun 2023
 

24-Jul-2023

Dear Dr Walsh:

Manuscript ID: DD-ART-06-2023-000121
TITLE: Element similarity in high-dimensional materials representations

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 1

This work compares element similarity between commonly used element representations, analyzes the periodic trends, and applies the elemental embeddings for crystal structure prediction as a use case. The authors provide a python package `ElementEmbeddings` to facilitate such analysis. The python package is overall well documented and their GitHub repo contains the codes to reproduce most of the results (except for crystal structure prediction) in the paper.

My main concern for this work is not so much on data/codes (though there are some as listed below), but on its scientific signficance and impact. Not much new insights or results are provided in this work, with Figures 1-6 being routine presentations without deep discussion. So what could warrant its publication is the `ElementEmbeddings` package developed by the authors, but this package does not offer much to a ML practitioner either: its functionalities are achieved via simple calls of external python packages such as sklearn and umap, so in practice the coding experience would be almost the same as using directly those packages. Maybe the authors will integrate more functionalities that can not be achieved easily with off-the-shelf packages, but at least its current implementation is too simple to offer an efficiency boost in practice. As such, I don't find the current work meets the standard of Digital discovery.

Below are some detailed comments on data and codes:

1. Element embedding data
1a. How were these data collected from external sources? Were some cleaning/preprocessing performed? The description of data source in the text is too vague, and I don't find the code showing how these embedding data are extracted/collected. Such codes are important to ensure reproducibility and to check data provenance. Another issue is that the code do not include feature labels for embedding data. The absence of the related code and the feature labels make it hard for users to inspect and understand the provided data, hence "checklist 3a" is marked as No.
1b. Please clarify in the text whether these embedding data are standardized/scaled during the analysis. This is important as it changes the results such as distances in Fig 1 and 2, and 2D projection in Fig 5 and 6. From the code, I have the impression that unscaled data are used for Fig 1 and 2 whereas scaled data are used for Fig 5 and 6. If so, the authors should explain why unscaled data are used, because typically people standardize features before calculating distance.
feature labels.
1c. The authors only presents the results in Figures 1-6, but do not really analyze the results. I think at least we should mention why different representations lead to different results. For instance, local representations are independent of training data, whereas distributed representations are learned from training data and are therefore dataset dependent. Therefore, their difference is not only due to the representation scheme, but also related to the training data. This is another reason to clearly describe the data source as mentioned in 1a.

2. Codes
2a. (related to checklist 5a, 5b, 6a) There is no code for Section D "Application to crystal structure prediction". I also find the description of this section unclear: how is the classification performed? Which ML model (decision tree?) do the authors use? What are the training and test sets? Is what the authors do equivalent to transfer learning?
2b. Minor concern over extensibility: The data availability statement mentions that the package is readily extendable to other representations, but I don't see how users can achieve this. The `Embedding` class is implemented with a `from_json` method, but the format is unclear and no example is provided.
2c. Two minor bugs: (1) If I try with `reducer="umap"` and `n_neighbors=30` in `dimension_plotter`, it throws an error because this parameter is also passed as a kwargs to sns.scatterplot function. (2) running `Embedding.load_data('mod_petti').as_dataframe()` throws out an error.

Reviewer 2

In the referenced manuscript Onwuli et al. present a study accessing on the correlation and similarity between elements using high-dimensional vector representations, and apply this concept to the structure classification of binary compounds. The authors utilize seven proposed approaches to represent element vectors consisting of a set of elemental properties. They calculate different distances between two vectors to measure their similarity, with the highlighted validity of cosine similarity. Then, the authors investigate the elemental distribution across the multi-dimensional space after employing dimensionality reduction techniques with PCA and t-SNE. Based on this analysis, they apply the similarity measure to classify the binary solids crystallizing in four typical prototypes, exhibiting the accuracy of > 70%. This work is very interesting and contributes to the research in the field of material structure prediction. However, there are still many concerns need to be resolved in the current version, and I would recommend this work for publication if the following comments can be addressed in full:

1. The current description of the methods used in the section II.A is not sufficient enough. Specifically, it is crucial to ascertain whether the authors employed specific elemental structures in generating the representation vectors and provide information on the original source of these structures. I suggest to enhance the transparency of this work by including the information of the representation generation based on different methods, along with appropriate references if applicable. Additionally, the authors need to provide the parameters utilized in the different methods employed, such as the cutoff values for identifying the coordinated and neighbored atoms. These details are essential for readers to gain a clear understanding on the process.
2. In page 2, the authors states that the Magpie is a “local” representation. What does this mean? Is Magpie the only local representation among the selected 7 representation methods?
3. Are there any basic principles adopted to construct the Random200 representation? The authors mentioned that the Random200 vectors were generated using Numpy code. How did the authors determine that these randomly-generated vectors can effectively distinguish different elements? It would also be helpful to provide similar references that have previously applied such kind of method, even if they demonstrated low effectiveness.
4. Did the authors normalize the representation vectors before calculating the distance? If they did, I have a great concern the similarity measure would be biased due to the potential information loss.
5. I’m very confused whether the absolute value of the distance should be proportional to the vector’s dimensionality for each selected measure. In Fig.1, it is observed that the 22-dimensional magpie-based vectors exhibit approximately twice the maximum absolute value compared to the 200-dimensional mat2vec-based vectors, and three times that of the 16-dimensional megnet-based vectors. Similar situation also arises when examining Fig.2. What’s the key factor affecting the absolute value of the distance? And which method of calculating the distance is the most convincing?
6. In Figs. 2-4, I can hardly discern any noticeable differences in the distance mapping between each pair of elements using Random200. It’s confusing why the authors have included the results based on this method, rather than utilizing alternative methods in the main text.
7. In page 3, the authors justify their selection of cosine similarity due to its scale-invariant nature. And in the Fig.3, the cosine similarity demonstrates a poorer ability to distinguish different element vectors when used in conjunction with mat2vec compared to other representations. Why this kind of combination of measure exhibits the highest classification accuracy, as shown in Table II?
8. The data for binary AB solids should be available. At least the Materials Project ID numbers for the selected ground-state structures, as well as the chemical substitutions on each prototype, should be provided.

Reviewer 3

This is an excellent article. The authors have worked out a new analysis for featurization in materials chemistry. The work analyzes a balanced dataset and is used to perform a classification as an example. Overall, this work is comprehensive and very well done. There was a lot of effort to try and find issues with the work, but it is a complete and well-written piece of research. I could not even find nitpicking comments. I recommend acceptance as is.


 

Dear Prof. Hippalgaonkar,

We thank the three referees for their supportive, well-thought-out and highly constructive reviews.

We have taken these on board in the revised manuscript and list the main changes below and hope that you consider the revised manuscript suitable for publication in Digital Discovery.

The associated codebase has been updated on https://github.com/WMD-group/ElementEmbeddings (V0.4). Apologies for the delay as the coding checks took some time to complete.

Yours sincerely,
Anthony Onwuli and Aron Walsh (on behalf of all authors)

This text has been copied from the PDF response to reviewers and does not include any figures, images or special characters:

Detailed response below with changes highlighted in the uploaded “highlighted.pdf”

Referee 1
This work compares element similarity between commonly used element representations, analyzes the periodic trends, and applies the elemental embeddings for crystal structure prediction as a use case. The authors provide a python package `ElementEmbeddings` to facilitate such analysis. The python package is overall well documented and their GitHub repo contains the codes to reproduce most of the results (except for crystal structure prediction) in the paper. My main concern for this work is not so much on data/codes (though there are some as listed below), but on its scientific signficance and impact. Not much new insights or results are provided in this work, with Figures 1-6 being routine presentations without deep discussion. So what could warrant its publication is the `ElementEmbeddings` package developed by the authors, but this package does not offer much to a ML practitioner either: its functionalities are achieved via simple calls of external python packages such as sklearn and umap, so in practice the coding experience would be almost the same as using directly those packages. Maybe the authors will integrate more functionalities that can not be achieved easily with off-the-shelf packages, but at least its current implementation is too simple to offer an efficiency boost in practice. As such, I don't find the current work meets the standard of Digital discovery.
Author response: We appreciate the positive feedback about the python package that we developed. We have taken on the suggestions and have addressed them and incorporated them in a new code release (v0.3.1 on Github) and into the revised manuscript as detailed in the responses below. Concerning the impact of the work, there have been many important developments in this community concerning element embeddings and representations; however, the parameter sets are difficult to compare and combine. Since releasing our preprint, several groups have approached us to include their representations in ElementEmbeddings. We see several important use cases: (i) for teaching and training to easily interact with and visualise high-dimensional element representations; (ii) to reproduce published models that use a given representation; (iii) to develop new models and/or combinations of representations in a streamlined manner. We hope that this will be a platform to enable further functionality as the reviewer suggests. These applications are very much in the domain and spirit of Digital Discovery.
1. Element embedding data
1a. How were these data collected from external sources? Were some cleaning/preprocessing performed? The description of data source in the text is too vague, and I don't find the code showing how these embedding data are extracted/collected. Such codes are important to ensure reproducibility and to check data provenance. Another issue is that the code do not include feature labels for embedding data. The absence of the related code and the feature labels make it hard for users to inspect and understand the provided data, hence "checklist 3a" is marked as No.
We thank the reviewer for this commentary. In both the docs and the repository for the python package, we have now extended the descriptions of how these data were collected. We have added more detailed descriptions within the text to specify and link the repositories from which we obtained the data. The Embedding class has now been updated in PR#73 to include feature labels when they are present (e.g. Magpie). These are accessible through the `feature_labels` attribute of the `Embedding` class. Additionally, when featurising dataframes that contain formulas, these feature labels are also preserved. Thank you for these nice suggestions.
1b. Please clarify in the text whether these embedding data are standardized/scaled during the analysis. This is important as it changes the results such as distances in Fig 1 and 2, and 2D projection in Fig 5 and 6. From the code, I have the impression that unscaled data are used for Fig 1 and 2 whereas scaled data are used for Fig 5 and 6. If so, the authors should explain why unscaled data are used, because typically people standardize features before calculating distance.
We appreciate the point about clarifying the scaling. Now all analysis is carried out on the standardised embedding data and this has been clarified in the main text. This avoids any confusion for the reader. In making these changes and ensuring consistency throughout, the performance metrics have been updated in the final section.
1c. The authors only presents the results in Figures 1-6, but do not really analyze the results. I think at least we should mention why different representations lead to different results. For instance, local representations are independent of training data, whereas distributed representations are learned from training data and are therefore dataset dependent. Therefore, their difference is not only due to the representation scheme, but also related to the training data. This is another reason to clearly describe the data source as mentioned in 1a.
This is a very good point. In response, we added additional analysis on Page 2 (Section IIA) incorporating clear information about the source of the embeddings and how the training routine affects the distributed, learnt representations.
2. Codes
2a. (related to checklist 5a, 5b, 6a) There is no code for Section D "Application to crystal structure prediction". I also find the description of this section unclear: how is the classification performed? Which ML model (decision tree?) do the authors use? What are the training and test sets? Is what the authors do equivalent to transfer learning?
We have now uploaded the scripts for the crystal structure prediction into the Publication folder of the repository and added further clarity to the text about the methodology with additions on Pages 5 and 6 of the main text and a new Figure 8. In brief, this work used a structure substitution approach where the probability of substitution is dependent on the pairwise cosine similarity. It uses a ranking on these probabilities and does not require any supervised classification model.
2b. Minor concern over extensibility: The data availability statement mentions that the package is readily extendable to other representations, but I don't see how users can achieve this. The `Embedding` class is implemented with a `from_json` method, but the format is unclear and no example is provided.
Thank you for highlighting this. We have now added examples of using the extensibility of the package into the ‘Examples’ of the repository.
2c. Two minor bugs: (1) If I try with `reducer="umap"` and `n_neighbors=30` in
`dimension_plotter`, it throws an error because this parameter is also passed as a kwargs to sns.scatterplot function. (2) running `Embedding.load_data('mod_petti').as_dataframe()` throws out an error.
(1) Thank you for highlighting this bug in our code. We have solved this bug in PR#70. Kwargs has been replaced with `reducer_params` and `scatter_params` which are dictionaries. `reducer_params` will be unpacked as a kwargs by `UMAP` and `scatter_params` by the sns.scatterplot function.
(2) For linear scales such as the modified pettifor scale or atomic numbers, we have now instead chosen to represent these as one-hot representations, where the ordering of the components is determined by the chosen scale e.g. H -> [1,0,0,…,0] with the atomic one-hot representation but would be H -> [0,0,0,…,1] using the modified Pettifor scale. These changes were included in PR#66.

Referee 2
In the referenced manuscript Onwuli et al. present a study accessing on the correlation and similarity between elements using high-dimensional vector representations, and apply this concept to the structure classification of binary compounds. The authors utilize seven proposed approaches to represent element vectors consisting of a set of elemental properties. They calculate different distances between two vectors to measure their similarity, with the highlighted validity of cosine similarity. Then, the authors investigate the elemental distribution across the multi-dimensional space after employing dimensionality reduction techniques with PCA and t-SNE. Based on this analysis, they apply the similarity measure to classify the binary solids crystallizing in four typical prototypes, exhibiting the accuracy of > 70%. This work is very interesting and contributes to the research in the field of material structure prediction. However, there are still many concerns need to be resolved in the current version, and I would recommend this work for publication if the following comments can be addressed in full:

We thank the reviewer for their careful reading and well-thought and constructive suggestions, which have been incorporated into the revised manuscript and are addressed below.
1. The current description of the methods used in the section II.A is not sufficient enough. Specifically, it is crucial to ascertain whether the authors employed specific elemental structures in generating the representation vectors and provide information on the original source of these structures. I suggest to enhance the transparency of this work by including the information of the representation generation based on different methods, along with appropriate references if applicable. Additionally, the authors need to provide the parameters utilized in the different methods employed, such as the cutoff values for identifying the coordinated and neighbored atoms. These details are essential for readers to gain a clear understanding on the process.
We agree and have addressed some of this in response to Reviewer 1. Except for the Random_200 representation, all the elemental representation vectors presented in this work were collected from other sources. Magpie and Oliynyk are collections of elemental properties. Skipatom, Megnet16, matscholar, mat2vec vector representations were generated from various machine learning studies which we have appropriately cited in the text. For the parameters which were used in the different methods, we refer to the original papers.
2. In page 2, the authors states that the Magpie is a “local” representation. What does this mean? Is Magpie the only local representation among the selected 7 representation methods?
We make the distinction in the text that local representations have “vector components with specific meaning” and distributed representations have “vector components learned from training data”. Magpie is defined as a local representation as the vector components each represent different element properties. We have now specified which representations are local or distributed in the main body of the text.
3. Are there any basic principles adopted to construct the Random200 representation? The authors mentioned that the Random200 vectors were generated using Numpy code. How did the authors determine that these randomly-generated vectors can effectively distinguish different elements? It would also be helpful to provide similar references that have previously applied such kind of method, even if they demonstrated low effectiveness.
We chose the Random200 representation as a control measure. Typically, random vectors are used as inputs to neural networks and within the neural network a representation of the elements can be learnt which results in decent performance for property prediction. A code to produce these vectors is now included in the repository.
4. Did the authors normalize the representation vectors before calculating the distance? If they did, I have a great concern the similarity measure would be biased due to the potential information loss.
We appreciate this suggestion. As mentioned in response to Reviewer 1, we chose to standardise the representation vectors before calculating distances. We justify this choice to normalise the vectors to prevent singular features from biasing our analysis. We elected to provide further commentary on this choice in SI as we do acknowledge this leads to qualitative and quantitative changes in our analysis of element similarity.
5. I’m very confused whether the absolute value of the distance should be proportional to the vector’s dimensionality for each selected measure. In Fig.1, it is observed that the 22-dimensional magpie-based vectors exhibit approximately twice the maximum absolute value compared to the 200-dimensional mat2vec-based vectors, and three times that of the 16-dimensional megnetbased vectors. Similar situation also arises when examining Fig.2. What’s the key factor affecting the absolute value of the distance? And which method of calculating the distance is the most convincing?
Thank you for this comment and we agree that is difficult to visualise in N-dimensions. The absolute value of the distance is not necessarily proportional to the dimensionality of the vector.
6. In Figs. 2-4, I can hardly discern any noticeable differences in the distance mapping between each pair of elements using Random200. It’s confusing why the authors have included the results based on this method, rather than utilizing alternative methods in the main text.
We chose to use the Random200 embedding as a control measure where there are no correlations between the vectors. We have justified this more clearly on Page 2 (“used here as a control measure”).
7. In page 3, the authors justify their selection of cosine similarity due to its scale-invariant nature. And in the Fig.3, the cosine similarity demonstrates a poorer ability to distinguish different element vectors when used in conjunction with mat2vec compared to other representations. Why this kind of combination of measure exhibits the highest classification accuracy, as shown in Table
II?
The visual appearance of the mat2vec heatmap in Figure 3 arises from the scale as the cosine similarities are within [0,1]. It is difficult to connect this directly to the performance in the classification tasks.
8. The data for binary AB solids should be available. At least the Materials Project ID numbers for the selected ground-state structures, as well as the chemical substitutions on each prototype, should be provided.
We fully agree. All the data has been made available, with scripts available to reproduce these results included in the repository under “Publications” and an expanded SI, which includes the compounds and their Materials Project ID.

Referee 3
This is an excellent article. The authors have worked out a new analysis for featurization in materials chemistry. The work analyzes a balanced dataset and is used to perform a classification as an example. Overall, this work is comprehensive and very well done. There was a lot of effort to try and find issues with the work, but it is a complete and well-written piece of research. I could not even find nitpicking comments. I recommend acceptance as is.

We appreciate the positive feedback from the reviewer and for recommending publication.




Round 2

Revised manuscript submitted on 24 Aug 2023
 

11-Sep-2023

Dear Dr Walsh:

Manuscript ID: DD-ART-06-2023-000121.R1
TITLE: Element similarity in high-dimensional materials representations

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry


 
Reviewer 2

I would recommend this work for publication.

Reviewer 1

I appreciate the efforts taken to improve the package and the manuscript. The revision is overall satisfying and I recommend its publication. Nonetheless, I am still a bit doubtful of the impact of the work, and I would suggest the authors to incorporate the relevant response in the manuscript to highlight the potential impact of the work. For instance, “there have been many important developments in this community concerning element embeddings and representations; however, the parameter sets are difficult to compare and combine” should be included in the introduction. “We see several important use cases… develop new models and/or combinations of representations in a streamlined manner” can be added as perspective use cases in the conclusion.




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license