From the journal Digital Discovery Peer review history

DiSCoVeR: a materials discovery screening tool for high performance, unique chemical compositions

Round 1

Manuscript submitted on 27 Oct 2021
 

29-Nov-2021

Dear Mr Baird:

Manuscript ID: DD-ART-10-2021-000028
TITLE: DiSCoVeR: a Materials Discovery Screening Tool for High Performance, Unique Chemical Compositions

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy from CASRAI, https://casrai.org/credit/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 1

This work introduces a new tool for multi-objective search of materials with chemical uniqueness (in terms of composition) and high performance (demonstrated via bulk moduli). This is achieved via the combination of multiple methods, from chemical-intuition-based distance metric, dimension reduction, to clustering. I appreciate that the authors provide a well-written publicly available repo for the codes and data associated with the work and give detailed instructions and tutorials on how to use them. In contrast to the repo, I think the ideas and methods are not that well presented in the manuscript and the results do not demonstrate the usefulness of the tool. Below I list my concerns and comments.

1. In terms of the methodology, no new techniques are developed in the work and all components of the proposed tool already exist. The novelty I can see is combing them to solve this particular problem, which I think is fine. I am not sure whether the dimension reduction step is necessary. Although classical clustering methods typically do not work well in extremely high-dimensional space, data in materials research is typically not that high dimensional. Then one can directly do clustering in the unreduced space. What is the dimension of the data in this work? It would be great to do some comparative study. Of course, dimension reduction can still be useful for visualization purposes. Also, any reason why choosing HDBSCAN* instead of DBSCAN as in ref.[54]?

2. The manuscript is hard to follow in its current form. It is difficult to get a high-level idea of what the contributions are from the abstract, probably because it is too abstracted. My biggest problem is with the methods. Since this is a tool using many techniques, I think it is extremely important to clarify the role of each technique. This is unfortunately not the case: each method is described, but how they are connected is far from clear. For example, what is the role of CrabNet? (I suspect it is used to generate the material fingerprints on which the dimension reduction and clustering are performed.)? What are the true and predicted proxy values in Eq.6 and how are they obtained? Are they related to $p_i$ in Eq.7? Besides, there are various minor but easily avoidable errors (see below).

3. The results do not demonstrate this is a useful tool. The results shown in Figures 2, 3, and 4 demonstrate what the tool can generate, but how to use them to do practical materials discovery is missing. It would be convincing if a practical use case can be demonstrated, even by simulated data. Another question is how to select the hyperparameters? For example, results in tables 3 and 4 do not have any materials in common and this is dependent on the weights. In practice, how can a user find their values, by cross validation?

4. Some future work can easily be done, with which the manuscript can be improved a lot. For example, "Because the weights used can have a significant effect on the rankings, it may be worth probing several values for a given study to elucidate and assess behaviors." and "In other words, “predicted” and “true” are identical due to implementation of DiSCoVeR at the time of writing.."

Some minor points:

1. The introduction seems not coherent. Many previous works are listed, but it is not obvious how they are related to this work and why this work is important.

2. Eq.1 is not the normal probability; the denominator part is missing. Also, it is not a tensor product (a tensor product would lead to the increase of order, for example, a tensor product between two vectors results in a second-order tensor). Better call it matrix multiplication or dot product here (same in Eq.3).

3. Page 4: "We split the data into training, validation, and test sets using a 0.8/0.2 train/val ...". no test set, only train/val split.

4. Fig.2: In the figure, some of the unclassified data points are on top of other classified points. Is this because the clustering is done in a higher-dimensional space? Better clarify. Also, better move the legend up to not block the data points.

5. Fig.5: there is a "-- pareto front" legend, but not used in the plot. Also, the figure is named "LOCO-CV results", but that's only for panel b, not a.

6. Page 8, 1st paragraph: "... k-nearest neighbor average (Figure 4a)...", 4a should be 4b. Same in the 3rd paragraph: "whereas in Figure 4a, cluster shapes exhibit similar orientations.".

7. Table 3: E_{pred,kNN} in the caption, but $\rho$ in the table header.

8. Table 4: E_{pred,kNN} in the caption, but $s_{kNN}$ in the table header.

Reviewer 2

The authors present as simple and extremely timely contribution to help the community develop quantitative metrics and approaches to discover truly new materials, rather than materials that are closely related to existing ones. Importantly, the current work defines some metrics to quantify how novel a material is, and reduces to practice a workflow using a set of readily available tools. While I'm certain we could discuss whether particular methods are the best-in-class, the point is that this is one example of how a workflow might be put together, as the reduction to practice is difficult in and of itself, and others will no doubt build upon this work.

As mentioned earlier, the authors present a _conceptual_ framework of defining novelty and methods that can incorporate this into any ML-driven searches in the future, and will be valuable to the research community.

Reviewer 3

I was intrigued by the topic of this paper, and the idea of providing a metric for material uniqueness and novelty. The paper features a well-written motivation in the introduction and an abundance of references to recent relevant work, that was highly appreciated.

Unfortunately, I find that the paper is somewhat hard to follow. It is very technical written, and applies several recently developed methodologies, that I expect only expert readers in ML-driven materials science will be familiar with. The paper would benefit from a more thorough description of used techniques, as well as a better explanation of the chemical/physical background. In general, a stronger connection to material properties would improve the paper. For example, the authors could list some of the materials in the different clusters in the DensMAP.

In the presented work, a clustering model for bulk moduli is constructed based on the chemical formula of the material with training data obtained from Materials Project. A few questions arise in that regard:

- Do the clusters obtained from DensMAP make intuitive sense? I.e. do the correspond to somewhat established material classes?
- The bulk modulus is structure dependent – but only the chemical formula is given as input to the model. How much of the variation in bulk modulus within a cluster is due to structure variations?
- On sourcing training data from Materials Project You write: “The highest bulk modulus is chosen when considering identical formulae.” Wouldn’t it make more sense to choose the most stable material? Would this have implications on the performance of the model?
- Please explain the “train contribution to validation log density” more clearly and why it is a measure of material uniqueness.

In summary, I would encourage the authors to revise the paper to make it more accessible and highlight the main conclusions of the paper more clearly.




 

We thank the reviewers for their constructive feedback which has led to significant improvements to the manuscript.

RESPONSE TO: REFEREE 1
"1. In terms of the methodology, no new techniques are developed in the work and all components of the proposed tool already exist. The novelty I can see is combing them to solve this particular problem, which I think is fine."
Apart from the novelty metrics (peak proxy and density proxy), we agree that as mentioned above, the workflow largely consists of combining recent, existing tools. To make this clear, we added a sentence to Section 2 (Methods).

"I am not sure whether the dimension reduction step is necessary. Although classical clustering methods typically do not work well in extremely high-dimensional space, data in materials research is typically not that high dimensional. Then one can directly do clustering in the unreduced space."
This is an important, but nuanced topic that is related to https://umap-learn.readthedocs.io/en/latest/clustering.html and https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne. Under certain circumstances, performing dimensionality reduction via UMAP was shown to produce better clustering results than that of the original high-dimensional data using k-means clustering or HDBSCAN, the latter of which likely failed due to sparsity of data in a high-dimensional (784) space.
In another sense of “use what works”, prior work (Element Mover’s Distance or ElMD) demonstrated that a similar setup produced chemically homogenous clusters; we adapted and extended this and linked it with other tools to meet the goal of DiSCoVeR. Hargreaves et al. (ref 54) discuss and provide an example that illustrates the utility of the Earth Mover’s distance over Euclidean distance (see Figure 1 of ref 54 and corresponding discussion, for example). While they don’t appear to mention use of ElMD distance matrices directly within a clustering algorithm in the manuscript, correspondence with an author of ElMD revealed that this was implemented offline with minimal success (https://github.com/lrcfmd/ElMD/issues/23).
It should also be noted that if the DensMAP step is removed, the density calculations are unavailable and thus the density proxy scores (one of the main contributions) cannot be evaluated without a suitable alternative. We added a comment to Section 2.1.

"What is the dimension of the data in this work?"
Technically, there is no fixed “dimension” of the data directly prior to obtaining the DensMAP embedding due to use of ElMD distance matrices rather than feature vectors. However, the inputs to ElMD use the (default) modified Pettifor scale as the elemental “feature scalars” (rather than vectors) which are then encoded into (typically sparse) vectors of size ~100 that represent the compounds. The issue is then that only methods which support either a weighted earth mover’s distance directly or a custom distance metric can be used.

"It would be great to do some comparative study. Of course, dimension reduction can still be useful for visualization purposes."
This is certainly of interest. It is possible that a much larger proportion of points will be considered unclassified if the distance matrices are used directly as “precomputed” inputs to HDBSCAN*. Per Reviewer #1’s suggestion, we clustered without dimensionality reduction and found that the # of clusters increased from 24 to 44. As might be expected (https://umap-learn.readthedocs.io/en/latest/clustering.html), the percentage of unclassified points increases from 4.8% to 23.2%, highlighting the difficulty of using density-based clustering algorithms with sparse, high-dimensional data. These results have been added to the text in Section 2.1.

"Also, any reason why choosing HDBSCAN* instead of DBSCAN as in ref.[54]?"
We noticed that ref.[54] used DBSCAN, but in general we consider HDBSCAN* to be a more sophisticated algorithm (note that the * simply refers to a version which eliminates the stochastic nature of the clustering results). This also more closely follows the clustering tutorial in the UMAP docs which demonstrated success with high-dimensional clustering. To retrospectively put “sophisticated” into clearer terms, “while DBSCAN needs a minimum cluster size and a distance threshold epsilon as user-defined input parameters, HDBSCAN* is basically a DBSCAN implementation for varying epsilon values and therefore only needs the minimum cluster size as single input parameter.” (https://hdbscan.readthedocs.io/en/latest/how_to_use_epsilon.html). Additionally, DBSCAN may be more prone to noise (https://dinhanhthi.com/dbscan-hdbscan-clustering/). Again, these are retrospective and empirically supported comments, but generally support the notion of “better” clustering. However, it may be of interest to compare differences between several clustering algorithms with the caveat that they need to be compatible with the density proxy used in our work. We think it is likely that DBSCAN would produce a reasonable list of “chemically unique” candidates; it is unclear to what extent this would deviate from the HDBSCAN* results.
We added additional commentary related to above in Section 2.1.

"2. The manuscript is hard to follow in its current form. It is difficult to get a high-level idea of what the contributions are from the abstract, probably because it is too abstracted. My biggest problem is with the methods. Since this is a tool using many techniques, I think it is extremely important to clarify the role of each technique. This is unfortunately not the case: each method is described, but how they are connected is far from clear."
The abstract has been updated to include more details about the methods. Figure 1’s caption was updated to reflect the interplay of various methods and a table was added which summarizes what each method does and how it fits into the DiSCoVeR workflow.

"For example, what is the role of CrabNet? (I suspect it is used to generate the material fingerprints on which the dimension reduction and clustering are performed.)?"
This should now be addressed with the previous edit, but for completeness the role of CrabNet is simply to predict properties. On the other hand, the material fingerprints that affect the uniqueness rankings are handled internally (and only internally) within ElMD. The only output from ElMD that is used for dimension reduction, clustering, and uniqueness proxies is the distance matrix.

"What are the true and predicted proxy values in Eq.6 and how are they obtained?"
These are the same and are obtained directly via DensMAP. Some commentary existed on this (“In the current implementation, however, the chemical uniqueness proxy is determined a-priori and simultaneously using the full dataset; thus, the error contribution from the chemical uniqueness proxy is zero.” Section 2.3); however, it was lacking reference to the variables from Eq.6 perhaps making it easy to overlook. We added another sentence with references to the variables in Eq. 6 to make it clearer.

"Are they related to $p_i$ in Eq.7?"
Yes, except that the $p_i$ in Eq.7 are scaled versions of the proxy. In Eq.7, we changed $E_i$ and $p_i$ to $E_{scaled,i}$ and $p_{scaled,i}$, respectively, to differentiate.

"Besides, there are various minor but easily avoidable errors (see below)."
These have been corrected.

"3. The results do not demonstrate this is a useful tool."
The difficulty here largely relates to the subjectiveness of chemical uniqueness. We added a paragraph to Section 3 elaborating on this.
Also, the abstract states, “We demonstrate that DiSCoVeR can successfully screen materials for both performance and uniqueness in order to extrapolate to new chemical spaces” which is under the assumption that ElMD is an “intuitive” chemical dissimilarity metric as supported by ref.[54]. In some sense, this is obvious – all that is needed is a metric for performance, a metric for uniqueness, a screening protocol, and an implementation that links these together; however, to our knowledge this is the first time this has been done in an automated fashion with explicit emphasis on chemical uniqueness. To avoid confusion, we have rephrased this sentence in the abstract.
Finally, we added a comparison with random search against several parameter combinations of the performance/proxy weightings. While random search is relatively naïve, it was a straightforward baseline to implement and provides more context to the reader to decide on the suitability of the tool for a given application.

"The results shown in Figures 2, 3, and 4 demonstrate what the tool can generate, but how to use them to do practical materials discovery is missing."
We agree that a description of how to do practical materials discovery using the results from the tool is missing. A subsection was added to Results and Discussion along with a table summarizing the process.

"It would be convincing if a practical use case can be demonstrated, even by simulated data."
Commentary on this added to Section 2.2. We’re working on using the tool with experimental efforts, but it’s more likely this will appear in a follow-up study with more emphasis placed on the practical considerations of using the tool (varying hyperparameters and assessing based on intuition/domain knowledge, literature searches, etc.).
However, we have added an adaptive design validation study that we believe addresses this feedback (see (3.) above).

"Another question is how to select the hyperparameters? For example, results in tables 3 and 4 do not have any materials in common and this is dependent on the weights. In practice, how can a user find their values, by cross validation?"
Commentary added to Section 2.3.

"4. Some future work can easily be done, with which the manuscript can be improved a lot. For example, 'Because the weights used can have a significant effect on the rankings, it may be worth probing several values for a given study to elucidate and assess behaviors.' "
This has been addressed with the addition of the random search comparison study (see (3.) from above).

"and 'In other words, “predicted” and “true” are identical due to implementation of DiSCoVeR at the time of writing..' "
This is a rather involved implementation. We developed a fast GPU-based earth mover’s distance function to use in conjunction with ElMD (a significant effort in itself); however, part of the reason this is fast is that the distance computations are spread over many GPU cores. On the other hand, supplying a custom distance metric to UMAP can be much slower. This likely wouldn’t be a problem except that we can already handle distance matrices on the order of ~50,000 x 50,000 on consumer hardware. To have an efficient implementation for much larger datasets (e.g. 1e6), we will likely need to inject parallelization into pynndescent which is used by UMAP which is then used by DiSCoVeR; if this is still too slow, then we may need to consider changing some of the algorithms within pynndescent or switching to a different algorithm entirely. We have some discussion open with UMAP’s primary developer on how to integrate these changes, but it will likely take a while. See https://github.com/lmcinnes/pynndescent/issues/136.

"Some minor points:"

"1. The introduction seems not coherent. Many previous works are listed, but it is not obvious how they are related to this work and why this work is important."
The introduction has been modified (in large part by rearranging) in order to improve the logical flow and motivation.

"2. Eq.1 is not the normal probability; the denominator part is missing. Also, it is not a tensor product (a tensor product would lead to the increase of order, for example, a tensor product between two vectors results in a second-order tensor). Better call it matrix multiplication or dot product here (same in Eq.3)."
This has been changed to “where the probability is proportional to (Eq. 1). We changed tensor product to matrix multiplication.

"3. Page 4: "We split the data into training, validation, and test sets using a 0.8/0.2 train/val ...". no test set, only train/val split."
Reference to “test sets” has been removed.

"4. Fig.2: In the figure, some of the unclassified data points are on top of other classified points. Is this because the clustering is done in a higher-dimensional space? Better clarify. Also, better move the legend up to not block the data points."
This likely arises from the use of a density-based clustering algorithm rather than a manifold partitioning clustering algorithm. In other words, the density is low, but appears high when visualized with thousands of points in a 2D plot due to "low magnification". This has been added to Section
3.1.
This may also be related to the distance threshold for merging clusters (cluster_selection_epsilon) which is another hyperparameter in the HDBSCAN* algorithm. We added a comment about this parameter in Section 3.1.
The following figure (not included in the manuscript) shows an overlay, albeit not with cluster label colors, of smaller points on the density map: https://sparks-baird.github.io/mat_discover/figures/dens-targ-scatter.png

"5. Fig.5: there is a "-- pareto front" legend, but not used in the plot. Also, the figure is named "LOCO-CV results", but that's only for panel b, not a."
During the most recent run/version of the data (i.e. the one included in the paper), a plot with a single point as the “Pareto front” was produced. Due to needing to rerun the data, a multi-point Pareto front was produced.

"6. Page 8, 1st paragraph: "... k-nearest neighbor average (Figure 4a)...", 4a should be 4b. Same in the 3rd paragraph: "whereas in Figure 4a, cluster shapes exhibit similar orientations."."
Corrected

"7. Table 3: E_{pred,kNN} in the caption, but $\rho$ in the table header."
Corrected (should be $\rho$ in both)

"8. Table 4: E_{pred,kNN} in the caption, but $s_{kNN}$ in the table header."
Corrected (should be $s_{kNN}$ in both)

RESPONSE TO: REFEREE 2
"The authors present as simple and extremely timely contribution to help the community develop quantitative metrics and approaches to discover truly new materials, rather than materials that are closely related to existing ones. Importantly, the current work defines some metrics to quantify how novel a material is, and reduces to practice a workflow using a set of readily available tools. While I'm certain we could discuss whether particular methods are the best-in-class, the point is that this is one example of how a workflow might be put together, as the reduction to practice is difficult in and of itself, and others will no doubt build upon this work.
As mentioned earlier, the authors present a conceptual framework of defining novelty and methods that can incorporate this into any ML-driven searches in the future, and will be valuable to the research community."

We certainly hope that this will be improved on in the future and agree that this is one example of a conceptual framework that could take on many different forms.

RESPONSE TO: REFEREE 3
"I was intrigued by the topic of this paper, and the idea of providing a metric for material uniqueness and novelty. The paper features a well-written motivation in the introduction and an abundance of references to recent relevant work, that was highly appreciated."

"Unfortunately, I find that the paper is somewhat hard to follow. It is very technical written, and applies several recently developed methodologies, that I expect only expert readers in ML-driven materials science will be familiar with. The paper would benefit from a more thorough description of used techniques, as well as a better explanation of the chemical/physical background."
The abstract has been updated to include more about the methods and each method’s role in DiSCoVeR. Figure 1’s caption was updated to reflect the interplay of various methods and a table was added which summarizes what each method does and how it fits into the DiSCoVeR workflow.

"In general, a stronger connection to material properties would improve the paper. For example, the authors could list some of the materials in the different clusters in the DensMAP."
The DensMAP cluster figure has been converted to an interactive format.

"In the presented work, a clustering model for bulk moduli is constructed based on the chemical formula of the material with training data obtained from Materials Project. A few questions arise in that regard:"

- "Do the clusters obtained from DensMAP make intuitive sense? I.e. do the correspond to somewhat established material classes?"
In general, a cluster exhibits similarity with respect to both chemical templates and elements used. For example, formulas with the chemical template A2B3 are closer together than those with chemical template A2BCD6. Additionally, compounds with the same template and few mismatched (or many similar) elements tend to be adjacent (e.g. Cs2KNiF6 vs. Cs2NaAlH6). This is a simplified view but suggests that the derived clusters are intuitive.
- "The bulk modulus is structure dependent – but only the chemical formula is given as input to the model. How much of the variation in bulk modulus within a cluster is due to structure variations?"
This is an interesting topic. We haven’t probed the variation in bulk modulus within a cluster; however, chemo-structural novelty is something that we’re interested in doing beyond this work and this is something we will keep in mind.
- "On sourcing training data from Materials Project You write: “The highest bulk modulus is chosen when considering identical formulae.” Wouldn’t it make more sense to choose the most stable material? Would this have implications on the performance of the model?"
Commentary has been added to Section 2.3.
- "Please explain the “train contribution to validation log density” more clearly and why it is a measure of material uniqueness."
We added a more thorough explanation of this to Section 2.2.1




Round 2

Revised manuscript submitted on 21 Dec 2021
 

03-Feb-2022

Dear Mr Baird:

Manuscript ID: DD-ART-10-2021-000028.R1
TITLE: DiSCoVeR: a Materials Discovery Screening Tool for High Performance, Unique Chemical Compositions

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

Thank you for publishing with Digital Discovery, a journal published by the Royal Society of Chemistry – connecting the world of science to advance chemical knowledge for a better future.

With best wishes,

Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry


 
Reviewer 1

First, I'd like to thank the authors for addressing my concerns. The revised manuscript has a much improved description of the scope and the methodologies (Besides the text, I find Table 1 and Table 7 particularly helpful). Chemical uniqueness, as the authors argued and I agree, is somewhat subjective and distance from the training set is a reasonable proxy. Given that the manuscirpt is more about the tool, but not applying the tool for materials discovery, I believe the examples presented here are fine for this purpose. But I am very interested to see what the tool can lead to in further works.

I'd recommend the manuscript be published in Digital Discovery.

Reviewer 4

Note: I am contributing a 'data review'.

The authors provide a compelling and well-presented case study in how currently existing tools can be combined to accelerate the process of materials discovery. Because there are many ‘moving parts’ in the article, I commend the authors for making it easy to follow and clarifying the role of individual subroutines in the workflow; Table 1 makes it easy to understand at a glance how each part of the workflow fits together.

The work is exemplary in providing well-commented and organized code, accessible and easy-to-read documentations, and means for reproducibility. These will all help to maximize the utility to the materials community for users who are interested in any particular functionality in the code.

To comment on the Data Availability section in particular, the colaboratory notebooks are helpful and well-organized. I was able to re-configure it to use a different data source successfully. The repository itself also has a well-stocked example section, including well organized and thought-out use cases that will make it easy for new users of the code to get started.

Data checklist comments (items which I marked 'no' on):

2b. The data cleaning steps are very clear (e.g. filtering noble gases, and thermodynamically unstable materials), as well as a clear and cogent criteria for choosing between materials which have the same composition but with different bulk moduli (e.g. with different allotropes of Carbon). This is a careful practical consideration and I think the authors made a reasonable choice on how to navigate this concern. The authors may wish to share how many data points these formula 'collisions' affected and which lead to omission from the training/validation data set.

3b. Featurization that can be compared against standards primarily comes into play in the manuscript when compositions are used as input to the regression model. While no individual featurization method is 'standard', though some are popular in the field (e.g. matminer's statistics-of-composition-based features, magpie, etc), no other are not provided. In this case, the features used by CrabNet are described and benchmarked against baselines elsewhere (Ref 40, "HotCrab" for one-hot featurization). I do not think any omission here presents a weakness in the paper, as the performance of CrabNet is not really the point of the main contribution, and the benchmarking of the model against standard features has already occurred in another manuscript.

4c. Comparisons to a 'current state-of-the-art' or non-trivial alternate models are not made. Comparisons are only made against random search (which in my view falls under the class of 'trivial models' in item 4b of the data reviewer checklist, which this method indeed outperforms). A "state of the art" for materials discovery is ill-defined, though, given the variety of different target properties and the efforts of this manuscript to focus on 'chemical novelty' as a novel target to guide materials discovery.


Issues/comments on the associated software:

1. The ‘generate_elasticity_data.py’ link, in the version of the manuscript I received, appears broken. Either remove it, or ensure it has a permanent link. (I was able to find it within the v1.2.1 release linked in reference [1], however, and was able to use it without issue).

2. The density scatter plots here: https://mat-discover.readthedocs.io/en/latest/figures.html#adaptive-design-comparison do not currently appear to be loading, as did the cluster count histogram- however, the rest of the plotly graphs are working.

3. In one of the code examples, it might be nice to include just one further step in the use of new data i.e. how to load in new datasets, not just a pointer to what they are.

4. The Binder notebook appears to have issues when ran-- both with an import in cell 6 and 12, and with plotting in cell 8, which claims the Discover class is missing an attribute. Please address this.




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license