From the journal Digital Discovery Peer review history

Not as simple as we thought: a rigorous examination of data aggregation in materials informatics

Round 1

Manuscript submitted on 14 Oct 2023
 

07-Nov-2023

Dear Mr Ottomano:

Manuscript ID: DD-ART-10-2023-000207
TITLE: Not as simple as we thought: A rigorous examination of data aggregation in materials informatics

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************


 
Reviewer 1

The authors wrote a nice manuscript that discussed a timely topic for ML: model-centric vs data-centric approach and specifically the better data vs more data sub-branches. Excellent literature review – short, yet complete, highlighting both DFT and experimental sources. The model-centric approach is praised by scientists with computer science backgrounds. While I understand that these scientists might be skeptical about this manuscript, I fully support Dr. Sparks and his co-authors. And also, I can add, that the parsimony principle should be applied to ML models as well. It is unlikely that a chemical problem requires a complex algorithm, given that nature prefers simple solutions.
1) The authors should discuss how more data and better data approaches sometimes contradict each other. The authors need to emphasize that less data in their case does not mean better data, since the training data selection was randomized.
2) Given that the authors’ conclusion is that sometimes less data is better (which I agree with), do they suggest that researchers should familiarize themselves with all original scientific reports before entering the datapoint manually? What is the reasonable size of the dataset (100, 1000, 5000) that the authors suggest if all data points are to be manually checked before being entered into the database?
3) Related to the previous question, one statement from the manuscript says “If input duplicates are found, we store their median.” Could the authors comment on if this is the most optimal approach? Why not use the most recent report or report from a reputable lab?
Overall, this is an excellent manuscript.

Reviewer 2

The article extends the recent research on Data Aggregation in Materials Informatics where there is rare work reported. Various state of art techniques has been considered. However, there are certain observations, which could make the manuscript better for the readers of the journal.
1. The Title of the article could be more formal.
2. Elaboration of the term chemical datasets “Unlike many other domains, chemical datasets can often be unbalanced, small in size, or collected under diverse experimental conditions” could have made the article interesting.
3. “As an example, in the context of thermoelectric materials, the introduction of chemistry defects through doping can lead to substantial alterations in electronic properties [Kdasap, 2002, Na et al., 2021].” There is loss of flow/poor connectivity with previous/subsequent sentence.
4. “we show that the incorporation of data points focusing on maximizing chemical diversity also leads to a worsening in the performance of such models.” The possible reasons for the same or possible directions for further research should be included, at appropriate place.
5. “First, we filter out values outside ±15 K of the room temperature, noble gases and radio-isotopes (atomic number (A) > 93).” What is the rational of filtering values outside ±15 K of the room temperature?
6. “These models include baselines and SOTA for chemical properties prediction given the stoichiometry, with representatives of both classical and Deep Learning (DL) approaches.” Please define the acronym SOTA before it is used further.
7. “This is done, as usual, by training on a subset (80%) of dataset A and computing prediction errors on the corresponding test set (20%). For DL models, 10% of the training size is reserved for a validation set.” Rational for having a different standard for the DL (10%)? If the results so generated remain comparable...?
8. “Table 2: A green color represents an improvement above one standard deviation with respect to the Baseline setting, yellow indicates equivalent performance (variations could simply be attributed to random fluctuations) and red denotes a worsening above one standard deviation.” Please check that all the data shown in the table is consistent with the color code scheme shown..For instance in Roost Regression… Shear modulus (c) 10.5±0.6 8.4±0.2, is it showing improvement…

Reviewer 3

This paper describes a study which shows that mixing datasets together does not necessarily lead to improved performance. Mixing data from different sources is indeed a huge risk but potential opportunity, but this paper does not (yet) meet the challenge addressed. I expected the key findings (e.g., concatenation leads to problems) and anticipate that many in the community would as well given a frequent pushback to building materials databases rhymes with “I don’t trust other’s data.” While the demonstration of these problems is clear and important, the paper does not rise to a level impact where I think it is ready for publication. As such, I recommend it be rejected and resubmitted later.

One main area that the paper leaves unaddressed is the idea that data from different sources should not be mixed but used in a two-step, transfer learning process. There are papers from at least a few years ago (https://arxiv.org/abs/1711.05099) and more recently (https://www.nature.com/articles/s41467-021-26921-5) demonstrating that various flavors of transfer learning are effective routes for mixing datasets. My opinion that transfer learning is the preferred choice for mixing data from obviously different sources (e.g., computation and experiment) is only strengthened by this paper. That is not the research direction proposed in the conclusion, better strategies for concatenation. So, I hesitate to recommend a paper for publication when I draw the opposite conclusion from what the authors intend.

To be clear, I would like to see the study continue in its direction. There are plenty of interesting leads that the paper’s results and discussion hint at which, if studied further, could make this an impactful paper. My ideas, which I provide not as demands that the author explore them but as examples, include:

- Explaining why concatenation seems to work in some cases. Shear modulus is the only case where mixing data seems to have a positive effect. Is that because computation agrees well with experiment for this property? If so, would I see more benefits from mixing datasets when both are from similar distributions (e.g., different collections of computation or experimental data)

- Testing whether no data in a region is better than biased data. Figure 4 starts to answer that question, but the analysis is yet underdeveloped. For example, did the discover algorithm intentionally fill data in the regions that were removed for the test set?

I would enjoy reading this paper again after the authors have had more time to pursue it further.


 

Dear editor,

We thank you for your kind consideration of our manuscript. As requested, we present our responses to referees' comments below.

Response to Referee 1

We thank the reviewer for the review and positive feedback. We appreciate the support and the words on the importance of the topic. As for the raised concerns and questions, individual points are addressed below.

1 a. Discuss how more data and better data approaches sometimes contradict each other.

We agree that the two approaches do sometimes contradict each other. In fact, indiscriminately expanding the size of the dataset often reduces the control of the quality of the acquired data points. This is crucial in the domain of materials science, given the highly heterogeneous nature of the data under consideration. The introduction of noise or inconsistent data is a considerable risk in such a context.
We thank the reviewer for highlighting this aspect, we have integrated the above considerations in the introduction section.

1. b. Emphasize that less data in their case does not mean better data.

In the context of our experimental analysis, this is only partially true.
In fact, it is true that we do not directly investigate the advantages of data efficiency or test on reduced datasets. This has been explored in other relevant related works [1,2]. However, we compare local repositories against aggregated datasets, which are inevitably larger in size, and show how this does not improve accuracy. We trace this back to the difficulties of ML methods in modelling the additional introduced noise and biases. Indirectly, this points in the direction of preferring less data, if these are more organic and controlled (better data). We thank the reviewer for raising this point, we now discuss this more extensively in Sec. 3.1, both in the context of traditional ML methods and deep learning state-of-the-art.

[1] Li et Al., Exploiting redundancy in large materials datasets for efficient machine learning with less data, nature communications (2023).
[2] Zhang et Al., ET-AL: Entropy-targeted active learning for bias mitigation in materials data, Applied Physics Review (2023).

2. a. Do they suggest that researchers should familiarize themselves with all original scientific reports before entering the datapoint manually?

Currently, our research suggests a close collaboration between computer scientists and chemists to ensure that the additional data points are collected under compatible circumstances with the original dataset. However, this is extremely time-consuming and, in the last paragraph of our paper (future directions), we stress the need for ML algorithms capable of guiding the aggregation of datasets of chemical properties.

2. b. What is the reasonable size of the dataset if all data points are to be manually checked?

It is hard to provide objective recommendations. Clearly, the quantity of acquired training data will depend on the available workforce and resources within a specific laboratory, as well as the type of chemical property under consideration. For instance, drawing from our experience, we have constructed datasets by manually incorporating material entries on the order of 10^2 carefully validated and recommended by a group of 2-3 chemists.

3. Could the authors comment on whether storing the median of duplicates is the most optimal approach?

This is indeed related to the previous consideration. Certainly, one can consider a close collaboration with chemists and manually checking all duplicate datapoints. Depending on the specific circumstances, this could lead to more organic datasets.
However, in our paper, we simulate a scenario where practitioners lack knowledge of the specific experimental conditions (or the reputation of laboratories) or where the available workforce is not enough to manually label each data.
In such scenarios, we argue that an automated approach would be preferable and that the median value is an acceptable pre-processing, as it biases the predictions towards outliers less than the mean.

Response to Referee 2

We thank the reviewer for the thorough review and feedback. We appreciate the numerous observations, for which we provide individual responses below.

1. More formal title

We kindly disagree with this stylistic choice, we believe there is still room, in science, for more slightly captivating titles. Moreover, we believe our title is an effective way to communicate the main message of the paper.

2. Clarify the term chemical datasets.

With the term 'chemical dataset', we refer to different data instances that are typical of materials informatics, e.g., numerical properties associated with crystal structure or chemical compositions. However, we agree that the term is vague. We have replaced it with "datasets of chemical properties".

3. Mentioning doping creates a loss of flow/little connectivity.

We agree with the reviewer and thank him for raising this. In the updated manuscript, we removed the example on doped materials and the consequent discussion on discontinuous input-target relationship. This is an important issue when dealing with materials data, however, loosely related to data aggregation. Instead, we provide additional clarification on how the typical large ranges of material properties affect data aggregation.

4. Possible reasons and possible future directions for why chemical diversity does not lead to better performance.

Our study shows that chemical diversity is not a good proxy for the performance of ML models in property prediction tasks. This can be attributed to the inherent challenges that ML models face in effectively fitting simultaneously diverse data points within a highly heterogeneous ambient space. We have elaborated on these considerations in Section 3.1. Given the aforementioned challenges, we envision future work focusing on the development of data aggregation algorithms. These algorithms would learn optimal aggregations of material datasets without relying on prior assumptions, such as the sole improvement of chemical diversity. Our study underscores that such assumptions can be counterintuitive for practical applications. We have included additional insights about future research directions in Section 5 (Conclusions). We thank the reviewer for this comment, which has allowed us to deepen this important aspect in our paper.

5. Rationale behind cut on temperature.

In our study, we apply this cut to remove the dependence of the results from the temperature. We preferred this choice over weighting down the consequent discussion, deviating from the main focus of the paper. Furthermore, since some of them do not report the temperature information, we make this choice to improve consistency across datasets. As for the numerical value, we choose ±15K as a compromise between preserving most data and consistency within the interval.
Regardless, our main claims are not affected by this value. We have provided additional clarification in the main text (Section 2.1).

6. Definition of SOTA.

Thank you for spotting this. The SOTA acronym refers to 'state-of-the-art'.We have now defined the acronym SOTA in the text.

7. Rational for having a different validation standard for the DL (10%)?

There might be a potential misunderstanding here. The validation set employed in the case of deep learning (DL) models (Roost and CrabNet) is not a substitute for the test set. In fact, in our paper, traditional ML methods and DL methods are tested on the same exact set, comprising 20% of the data. Instead, what distinguishes traditional ML from DL is the training procedure.
Deep learning models are structurally different, and they require a third, separate, set of data to progressively check if the training is proceeding correctly and prevent overfitting. This is referred to as 'validation' set.
Consequently, for DL methods, we reserve a small fraction of the training set for this purpose. This results in proportions 80% (train), 20% (test) for traditional ML methods, and 70% (train), 10% (validation), 20% (test) for DL methods.

8. Colour scheme in Table 2

Thank you for carefully checking the reported numbers. We have double-checked the numbers reported in the tables and we are confident that the utilized color scheme is now correct.

Response to Referee 3

We thank the reviewer for the valuable feedback. As for the raised concerns and questions, individual points are addressed below.

1. Impact level

Recent applications of machine learning in materials science have been democratizing access to this field, even for individuals without specialized expertise. We contend that a researcher with a pure computer science background might draw intuitive conclusions based on principles established in more popular domains like computer vision or natural language processing, where the motto "more data leads to better performance" often holds true. Consequently, we posit that emphasizing important issues directly bonded to the unique nature of the utilized data can be extremely beneficial for individuals working at the intersection of machine learning and chemistry, despite possessing limited domain knowledge in the latter.

2. Addressing transfer learning

We agree that addressing transfer learning could be an important addition to strengthening our paper. For this purpose, we have conducted further experiments and enriched the paper with the corresponding discussion.
In particular, we adopt the fine-tuning approach, as is preferred in one of the referenced papers, when source and target property coincide (as in our case). Therefore, we have pre-trained Roost and CrabNet on source datasets (dataset B in our case) and then transferred the trained weights as initializations to the models on train dataset A. More details can be found in the 'Transfer learning' paragraph (Sec. 3) of the updated paper. By introducing such an aggregation method (under the name of 'transfer learning' in Tab. 2), we indeed observed some improvements over our baseline, but only a few sporadic significant ones. In fact, the improvements are not consistent across datasets and most times comparable to the simple concatenation approach.

We have expanded our discussion in Sec. 3.1 by including our interpretation of such results, which we report below briefly.
We acknowledge that transfer learning proves effective when DL can be pre-trained on large databases (e.g., formation energy) and fine-tuned on more specific material properties. However, our task is different, as we require the source and target property to be the same. Moreover, our study presupposes the knowledge of stoichiometry alone, with a corresponding restricted pool of information. In light of these considerations and our experimental analysis, we posit that the information transferred from one dataset to another through transfer learning would not be substantial. Furthermore, transfer learning may face challenges when confronted with values reported under different experimental conditions. However, we do believe transfer learning remains an interesting approach to be considered when investigating aggregation of information from multiple sources and we envision a deeper investigation of it in our future work.

While we appreciate the reviewer's valuable suggestion, we believe the core message of our paper remains unchanged.

3. Future directions

We posit that transfer learning and data aggregation can be concurrently pursued. Acknowledging the potential effectiveness of this approach in the context under examination, we have incorporated a more in-depth exploration of it into our future research work.

4. a. Discuss improvements for Shear modulus (c)

The enhancement in shear modulus is evident if considering the same (DFT-calculated) nature of both datasets (A and B). Notably, in the case where B is experimental, no improvement is observed.

4. b. would I see more benefits from mixing datasets when both are from similar distributions?

Our analysis suggests that datasets originating from similar distributions should be prioritized under data-aggregation strategies.

5. Testing whether no data in a region is better than biased data.

The comparison between proposed aggregation methods and 'baseline' (corresponding to the 'no data' case, i.e. where no aggregation is performed) partially explores the context of 'no data versus biased data'. Concerning the A-A data aggregation case, examined in Section 4.1, the primary motivation is to emulate an acquisition process based on data originating from the same repository, and therefore assumed to be more coherent. Our analysis suggests that introducing new chemistries at early stages of the aggregation process does not yield improvements regarding ML performance. This observation may be related to the extreme heterogeneity of the underlying chemical space. In the current state, we assert that our paper delivers a correct message, suggesting caution in the aggregation of diverse material datasets. Additionally, it sheds light on counterintuitive outcomes associated to aggregation schemes that prioritize chemical diversity.


We now provide a brief summary of the new uploaded files as requested:

'ToC_entry.docx': table of contents entry (graphical abstract).

'revised_manuscript_highlights.pdf': the revised paper by highlighting the changes (in orange) according to reviewers' comments.

'revised_manuscript_no_highlights.pdf': final version of the paper without highlighting changes.

'high_quality_figures.zip': 600dpi .pdf version of the figures presented in the article.

'tex_source.zip': contains the editable .tex file.

'response_to_referees.pdf': a .pdf formatted version of our response to referees' comments.

We thank you again for your consideration and look forward to your kind response.

Federico Ottomano,
Giovanni De Felice,
Vladimir Gusev,
Taylor Sparks




Round 2

Revised manuscript submitted on 05 Dec 2023
 

22-Dec-2023

Dear Mr Ottomano:

Manuscript ID: DD-ART-10-2023-000207.R1
TITLE: Not as simple as we thought: A rigorous examination of data aggregation in materials informatics

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Joshua Schrier
Associate Editor, Digital Discovery


 
Reviewer 1

The authors addressed all the comments in full.

Reviewer 2

The manuscript appear cohesive.




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license