From the journal Digital Discovery Peer review history

A human-in-the-loop approach for visual clustering of overlapping materials science data

Round 1

Manuscript submitted on 08 Sep 2023
 

06-Nov-2023

Dear Dr El-Mellouhi:

Manuscript ID: DD-ART-09-2023-000179
TITLE: Overcoming Challenges for Visual Clustering of Overlapping Mate- rials Science Data

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Professor Jason Hein
Associate Editor, Digital Discovery

************


 
Reviewer 1

This report provides an overview of the data and code revision for the manuscript "Overcoming Challenges for Visual Clustering of Overlapping Materials Science Data" by S. Bonakala et al. The manuscript introduces a timely proposal aimed at addressing the challenge of discovering patterns within complex materials databases using an unsupervised learning approach. Despite using only well-established models, the authors’ proposed pipeline shows promise in alleviating uncertainty issues in identifying relevant subgroups within the vast chemical space. However, from a technical perspective, the current version of the manuscript lacks critical details, such as the availability of the dataset to the public and the disclosure of the implemented data analysis code, both of which are necessary to meet the reproducibility standards of the Digital Discovery journal. In addition, there are other minor issues outlined below that the authors may want to consider when revising their manuscript.

Data sources: Availability of the data used

The authors have indicated that the data used in their manuscript will be accessible via their AIMATER platform. However, it is recommended practice to provide, at least a sample of the data utilized in the manuscript for validation purposes during the review process.

Code and reproducibility: availability of code and scripts

Scripts or workflow necessary to reproduce the findings presented in the manuscript is not available in a public repository and were not provided as supplementary material. Although certain elements of the authors' proposed pipeline can not be fully automated due to the need for expert intervention, there are still other steps within the workflow that could be reproduced if the necessary scripts were made available. The absence of such technical information restricts the ability of reviewers and the scientific community to fully evaluate and replicate the authors' methodology.

Additional comments:

1. The authors have employed two widely recognized dimensionality reduction (DR) techniques, t-SNE and UMAP, as a baseline comparison for their hybrid pattern identification approach. It is worth noting that a recently published DR method called PacMAP seems to outperform t-SNE and UMAP in terms of preserving both global and local data structures within low-dimensional embeddings. For further information, the authors might find it interesting to explore the following link to the manuscript: https://arxiv.org/abs/2012.04456.

2. The qualitative decision to merge clusters based on human intervention should be taken with a grain of salt. Due to the curse of dimensionality, data points tend to exhibit greater dispersion in the original data space. Thus, depending on the dimensionality reduction technique employed, the apparent overlap between clusters observed in the low-dimensional space may well be an artifact of the projection method. The manuscript currently lacks this critical perspective, and its inclusion would aid readers in comprehending the limitations of the authors' approach.

3. While the manuscript focus primarily on a qualitative data analysis through an unsupervised approach, it is important to note that there exists a range of metrics designed to assess the quality of clustering methods in a quantitative manner. These metrics could provide a more robust evaluation of cluster pairs, specially in scenarios where clusters exhibit different shapes, densities, or possess ill-defined boundaries. Expanding the manuscript to include these metrics would contribute to a more comprehensive and quantitative evaluation of the proposed methodology.

4. On page 2, the authors mention that "all these studies resulted in overlapped clusters of the MOF and other inorganic materials datasets that lack delineation of the isolated cluster." It is important to note that achieving non-overlapping clusters is not necessarily a limitation of a method. Fuzzy clustering, as exemplified by the Gaussian Mixture model (GMM) used by the authors, is an active research area, allowing data points to belong to multiple clusters with varying probabilities. In the context of materials properties, this approach may be reasonable, as it acknowledges shared properties across clusters. Hence, it would be valuable for the readers if the authors can provide further clarification on the relevance of enforcing a rigid partition of the chemical space.

Reviewer 2


The submitted manuscript “Overcoming Challenges for Visual Clustering of Overlapping Materials Science Data” is built around the analysis of a small dataset of about 1,000 Cu-MOFs described by five computed properties. The authors illustrate that usual t-SNE, UMAP, and PCA projections combined with clustering methods are not reliable in finding effective clusters of MOFs in the feature space.

One reasonable argument is made that it is hard to quantify the quality of clustering methods, so the authors elegantly propose to analyze the separation between two clusters on a 2D plot (logDA plots) in which they use the “decision axis of the logistic regression as the first visualization axis, and the first principal component in the subspace orthogonal to the logistic axis as the second visualization axis”. This methodology is used to regroup the 10 clusters found with their first method into independent clusters.

The major limitation of this article is that the method is only applied on a relatively small dataset compared to the (~500k MOFs available) and with very few features (5). The authors do not comment on why these 5 features were selected and the manuscript would be more impactful if they studied the technique in higher dimensions with less "human bias" in the selection of the descriptors as the goal is to understand more widely counter-intuitive patterns. I would recommend, if possible, trying the workflow on a larger dataset and with a variable number of features.

The title “Overcoming Challenges…” suggests that the manuscript will demonstrate that the new GMM-EDDA clustering technique is better. Thus, applying the logDA techniques to PCA-clustering or other usual clustering methods (kmeans, Ward...) is needed to compare efficiency in determining clusters. This comparison would be preferable to Figures 5 and 6.

Could a quantitative metric be provided for the ‘M’ and ‘S’ decision? I think it is very arbitrary here, especially when considering “density” for plots with about 100 points. As an example, even if it does not affect the final clustering method, I would attribute an ‘M’ to the C6-C7 plot rather than an ‘S’. Same for the black and white C1-8. A quantitative metric would make the workflow very appealing, and I don’t think that an analyst does better than a machine when the axes cannot be physically interpreted.



For the data-review checklist, I have responded ‘No’ to the following points:

1c. Are any potential biases in the source dataset reported and/or mitigated?

The authors report that they focus on Cu-MOFs due to their industrial applications. There is no code reported that shows how the structures were filtered out from the initial database.

3b. Are comparisons against standard feature sets provided?

The authors made the choice to use simulation-derived parameters for the clustering of the MOFs discussed in order to do a top-down clustering approach, my understanding is that standard feature sets are more available for SAR and may not apply here (https://deepchem.io/tutorials/introduction-to-material-science/)

4a. Is a software implementation of the model provided such that it can be trained and tested with new data?

The authors do not provide the implementation of the code to reproduce figures and logDA methodology. The clustering done at the end of the paper needs human in the loop decision to be performed.
I recommend publishing a GitHub repository with the data and the code used to go from the features to the logDA plots.

4b. Are baseline comparisons to simple/trivial models (for example, 1- nearest neighbour, random forest, most frequent class) provided?

I recommend applying the same methodology to usual clustering techniques to compare the outcomes (this has already been developed before). Simple k-means and ward algorithms could be tested. The manuscript would benefit from having these figures in SI.

6a. Is the code or workflow available in a public repository?

No mention of any repository in the manuscript.

6b. Are scripts to reproduce the findings in the paper provided?

No mention of any script in the manuscript.


I responded ‘Yes’ to the following question but have some comments:

3a. Are methods for representing data as features or descriptors clearly articulated, ideally with software implementations?

It would be helpful to share the data obtained is a future repository to enable the reproduction of the results. The methods used to obtain them are clearly described in the manuscript. Clear identification of the MOFs would be preferable.


In conclusion, I think that the idea presented is valuable for publication but needs to be more carefully compared to existing techniques and would gain from being examined in different situations. The workflow does not have to be specifically applied to MOFs and could be applied to any virtual datasets.


 

REVIEWS Response
Referee 1
This report provides an overview of the data and code revision for the manuscript "Overcoming Challenges for Visual Clustering of Overlapping Materials Science Data" by S. Bonakala et al. The manuscript introduces a timely proposal aimed at addressing the challenge of discovering patterns within complex materials databases using an unsupervised learning approach. Despite using only well-established models, the authors’ proposed pipeline shows promise in alleviating uncertainty issues in identifying relevant subgroups within the vast chemical space. However, from a technical perspective, the current version of the manuscript lacks critical details, such as the availability of the dataset to the public and the disclosure of the implemented data analysis code, both of which are necessary to meet the reproducibility standards of the Digital Discovery journal. In addition, there are other minor issues outlined below that the authors may want to consider when revising their manuscript.
We thank the reviewer for the encouraging comments. We included all the programs to generate and process the data in the Supplementary Information and our Github repository.
Data sources: Availability of the data used

The authors have indicated that the data used in their manuscript will be accessible via their AIMATER platform. However, it is recommended practice to provide, at least a sample of the data utilized in the manuscript for validation purposes during the review process.

Code and reproducibility: availability of code and scripts

Scripts or workflow necessary to reproduce the findings presented in the manuscript is not available in a public repository and were not provided as supplementary material. Although certain elements of the authors' proposed pipeline can not be fully automated due to the need for expert intervention, there are still other steps within the workflow that could be reproduced if the necessary scripts were made available. The absence of such technical information restricts the ability of reviewers and the scientific community to fully evaluate and replicate the authors' methodology. Following the reviewer’s advice, the data scripts and code utilized in the manuscript are available at our Github repository :
https://github.com/elfedwa/Visual-Clustering-of-Overlapping-Materials-Science-Data.git

1. The authors have employed two widely recognized dimensionality reduction (DR) techniques, t-SNE and UMAP, as a baseline comparison for their hybrid pattern identification approach. It is worth noting that a recently published DR method called PacMAP seems to outperform t-SNE and UMAP in terms of preserving both global and local data structures within low-dimensional embeddings. For further information, the authors might find it interesting to explore the following link to the manuscript: https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Farxiv.org%2Fabs%2F2012.04456&data=05%7C01%7C%7Ca5534e103e484d8541af08dbde809b86%7C0edca4720b7146e696c70a68c10dcb96%7C0%7C0%7C638348416198982598%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3yjgYfmlFO0PkALXG3%2BMuIItaSDkPm0vJUtBJ9RSDXE%3D&reserved=0.
We thank the referee for pointing out the PacMAP method and bringing it to our attention. We have added a mention of the method and cited it on page 2, paragraph 2.

“Dimensionality reduction (DR) techniques like t-SNE, UMAP, and Pairwise Controlled Manifold Approximation Projection PacMAP)~\cite{Wang2021} have shown effective visualization results on various real-world datasets. As already mentioned, the main issue with DR techniques is the loss of information or embedding distortions~\cite{NonatoAupetit2019,CollangeSupNeRV2020} that can result in actual data clusters being represented as overlapping in the embedding, as typical of PCA~\cite{Elhaik2022issuesWithPCA,abbas2023clustml}, while other nonlinear neighbor embedding techniques like tSNE, UMAP or PacMAP may also be subject to cluster split~\cite{Wattenberg2016tsne}. As it is essential to get a trustworthy visualization of cluster structure to support visual clustering by end-users, we propose an approach based on a set of logistic-based linear projections specifically designed to avoid cluster overlap when embedding each pair of pre-computed clusters.”
2. The qualitative decision to merge clusters based on human intervention should be taken with a grain of salt. Due to the curse of dimensionality, data points tend to exhibit greater dispersion in the original data space. Thus, depending on the dimensionality reduction technique employed, the apparent overlap between clusters observed in the low-dimensional space may well be an artifact of the projection method. The manuscript currently lacks this critical perspective, and its inclusion would aid readers in comprehending the limitations of the authors' approach.
This is indeed true, this being said we actually use logistic regression to minimize the chance of overlap between clusters found by the clustering technique. Indeed, the use of logistic regression ensures that if two clusters are linearly separable in the high dimensional data space, they will be so in the logistic projection space (the reverse is not true, though). Moreover, because we are considering initial clusters from a Gaussian mixture model, the clusters are more likely Gaussian distributed hence they are naturally convex (ellipsoids) in the high dimensional space.
However, it is possible that the clusters captured by the GMM are not actually Gaussian distributed, in such cases, the end-user may observe both clusters overlap while they are actually well (non-linearly) separated.

We thank the referee for attracting our attention to this important point and to highlight it. We added the following reflection to the manuscript in section 5, page 5 to provide an enhanced critical perspective:

“ This work employs logistic regression to minimize the probability of overlap between clusters. Worth mentioning some limitations related to the fact that we consider linear separation between clusters only, so we may merge two clusters if they are not linearly separable enough. However, because we are considering clusters from Gaussian mixture model, the clusters are Gaussians hence they are naturally convex (ellipsoids), and they become non convex only when two cluster overlap (one dense gaussian distribution inside the area of a larger less dense gaussian distribution for instance).”
3. While the manuscript focus primarily on a qualitative data analysis through an unsupervised approach, it is important to note that there exists a range of metrics designed to assess the quality of clustering methods in a quantitative manner. These metrics could provide a more robust evaluation of cluster pairs, specially in scenarios where clusters exhibit different shapes, densities, or possess ill-defined boundaries. Expanding the manuscript to include these metrics would contribute to a more comprehensive and quantitative evaluation of the proposed methodology. We thank the referee for the relevant point.

Our proposal emphasizes the importance of human-in-the-loop decision in clustering. Clustering techniques and clustering quality metrics are both a form of human knowledge embedded in a computational function subject to the same issue mentioned by the reviewer, except they are predefined and generic, while our approach let the end-user decide for each pair of clusters based on a wide range of possible visual patterns. We argue that letting the end-user decide by visual analysis of carefully chosen two-dimensional linear projections of the data can be used to improve base clustering technique to detect non linear clusters and engage the end-user in the decision. The individual decision can still be discussed between several experts visualizing the same data, to come to a more objective consensus decision if needed. To account for this emphasis, we rephrased the title “A Human-in-the-loop Approach for Visual Clustering of Overlapping Materials Science Data”.

Regarding quantitative evaluation, it is important to note that clustering quality metrics are designed to compare alternative clusterings of the same set of points, and cannot output a value if there is less than two clusters. As a result, it is not possible to use such metric to decide if two clusters should be merged into one (no score available). Nor is it possible to give a meaningful score comparable across all cluster pairs because their point sets are differents. We added a discussion of these scores in Section 4.

4. On page 2, the authors mention that "all these studies resulted in overlapped clusters of the MOF and other inorganic materials datasets that lack delineation of the isolated cluster." It is important to note that achieving non-overlapping clusters is not necessarily a limitation of a method. Fuzzy clustering, as exemplified by the Gaussian Mixture model (GMM) used by the authors, is an active research area, allowing data points to belong to multiple clusters with varying probabilities. In the context of materials properties, this approach may be reasonable, as it acknowledges shared properties across clusters. Hence, it would be valuable for the readers if the authors can provide further clarification on the relevance of enforcing a rigid partition of the chemical space.
We agree with the reviewer. But the GMM is essentially a model of the density of the data, it may use artificially many components to cover the same area of the data space just because the distribution there is not perfectly Gaussian. By merging clusters using visual check, we add information about separability or nestedness of the clusters. The network summary also indicate the complex cluster structure and id decided by human eyes allowing for a consensus decision among experts rather than a blind trust of a fully automatic technique. The merging decision can be objectified by showing the LDA plots that support these merging decision, so other analysts can (in)validate the decision process. Moreover, our approach keep all the GMM information intact about shared probabilistic properties among clusters. The human in the loop merging decision is complementary to the probabilistic membership, it does not replace it, the probabilistic/fuzzy assignment is still available for informing further analysis downstream.

We added a discussion of these aspects in Section 4.
Referee 2
1 The submitted manuscript “Overcoming Challenges for Visual Clustering of Overlapping Materials Science Data” is built around the analysis of a small dataset of about 1,000 Cu-MOFs described by five computed properties. The authors illustrate that usual t-SNE, UMAP, and PCA projections combined with clustering methods are not reliable in finding effective clusters of MOFs in the feature space.

One reasonable argument is made that it is hard to quantify the quality of clustering methods, so the authors elegantly propose to analyze the separation between two clusters on a 2D plot (logDA plots) in which they use the “decision axis of the logistic regression as the first visualization axis, and the first principal component in the subspace orthogonal to the logistic axis as the second visualization axis”. This methodology is used to regroup the 10 clusters found with their first method into independent clusters.
We thank the reviewer for the encouraging comments.
2 The major limitation of this article is that the method is only applied on a relatively small dataset compared to the (~500k MOFs available) and with very few features (5). The authors do not comment on why these 5 features were selected and the manuscript would be more impactful if they studied the technique in higher dimensions with less "human bias" in the selection of the descriptors as the goal is to understand more widely counter-intuitive patterns. I would recommend, if possible, trying the workflow on a larger dataset and with a variable number of features.
We thank the reviewer for pointing out to this important point. Generally speaking, adding more features will likely lead to more clusters (if the new features separate the data) or the same number of clusters (if new features are noise). The fact that we use expert knowledge (Chemist is this case) does not alter the benefit of the method. We show that even with this feature pre-selection, focusing only on the features that are likely to lead to separation according to expert intuition, standard approach fail to reveal clusters, while our approach do reveal them.
To demonstrate the above, we employed the method to a subset of 20,000 candidates from the QMOF database and considered 12 features that include chemical, topological and techno economic features designed for publication within a subsequent paper (Please refer to the document 20k-12features.pdf for the referees and editor review only). We demonstrate that our developed method applies without technical limitation to a 20 times larger sample size of course coming with increased computational cost. Turning to the features, in this particular case increasing the number of features from 5 to 12 led to finding 4 clusters. GMM have limitations in practice in terms of the number of dimensions they can operate, because the number of parameters to estimate grows quadratically with the dimension (covariance of each component) and is proportional to the number of Gaussian clusters, and the number of data to estimate these parameters should be at least of the same order of magnitude. In practice, GMM can operate in about 10-20 dimensions, and Principal Component Analysis can be used as a pre-processing step to reduce the dimension to 10-20 if data have more dimensions but a too small sample size. Regarding the number of cluster pairs to visualize, it is quadratic with the number of clusters found by the GMM (e.g. 45 for 10 clusters). Hence, the proposed method to include the human in the decision to merge and separate in commensurate with the GMM limitation making it very adequate to human led intervention to achieve the needed scientific understanding from the separated clusters. We added a discussion of this aspect in Section 4.
3 The title “Overcoming Challenges…” suggests that the manuscript will demonstrate that the new GMM-EDDA clustering technique is better. Thus, applying the logDA techniques to PCA-clustering or other usual clustering methods (kmeans, Ward...) is needed to compare efficiency in determining clusters. This comparison would be preferable to Figures 5 and 6. We thank the referee and subsequently repharased the title “A Human-in-the-loop Approach for Visual Clustering of Overlapping Materials Science Data” to emphasize our focus on integrating end-user decision within the clustering pipeline and discuss qualitative aspects of this alternative.

We also added results from running Kmeans instead of GMM on the same data (see Supplementary material).
4 Could a quantitative metric be provided for the ‘M’ and ‘S’ decision? I think it is very arbitrary here, especially when considering “density” for plots with about 100 points. As an example, even if it does not affect the final clustering method, I would attribute an ‘M’ to the C6-C7 plot rather than an ‘S’. Same for the black and white C1-8. A quantitative metric would make the workflow very appealing, and I don’t think that an analyst does better than a machine when the axes cannot be physically interpreted.


We are glad that the referee came with a different opinion about the merge and separation decision. The fact that we now debate about the decision of merging clusters while they seemed impossible to separate using other methods puts the current effort in relevance !
It also demonstrates the diversity of decision when quick and low cost human visual inspection is involved giving both freedom and opportunity to domain experts to debate and arrive at an agreement then validate.

Yes, using an automatic score that outperform human decision would be great but as demonstrated above, a pipeline that ensures the reproducibility of results is under development by the community. Second, a well optimized and reproducible score would require a high GPU computational cost that is not affordable and accessible to all researchers. Third, previous work has demonstrated that human perception substantially depart from automatic methods in clustering [REF VIS2019] and it is hard to design an automatic criterion for cluster separability [REF Eurovis]. Fourth, existing criteria are not adapted to evaluate single versus two clusters, nor to compare clusterness across different sets of points. We discuss these aspects in Section 4.


Here we show that human-in-the-loop can quickly and effectively still intervene to cluster high dimensional data given that the number of features does not exceed 15-20. There is no “better than machine”, still any clustering requires some form of user validation.

As a perspective and open question to the community to join forces it would be very beneficial to create/adopt a metric to automate the merging process, and at the end the pipeline involve humans to visually check and possibly amend the automated decision. We propose this option as a future work.

5 For the data-review checklist, I have responded ‘No’ to the following points:

Are any potential biases in the source dataset reported and/or mitigated?

The authors report that they focus on Cu-MOFs due to their industrial applications. There is no code reported that shows how the structures were filtered out from the initial database.

The CoreMOF database is provided in CSV format and the down selection of the subset of Cu-MOFs was simply done by filtering over the metal site attribute. Now the subset of data are available on our Github repository
6 Are comparisons against standard feature sets provided?

The authors made the choice to use simulation-derived parameters for the clustering of the MOFs discussed in order to do a top-down clustering approach, my understanding is that standard feature sets are more available for SAR and may not apply here (https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdeepchem.io%2Ftutorials%2Fintroduction-to-material-science%2F&data=05%7C01%7C%7Ca5534e103e484d8541af08dbde809b86%7C0edca4720b7146e696c70a68c10dcb96%7C0%7C0%7C638348416198982598%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8TImOMDBfi7KLg69RBW7DumrDp9tlAS%2FRi262aim8fU%3D&reserved=0)

We thank the referee for raising this point about featurization in materials science and directing us to renowned automated tools to perform this task. Indeed, our starting point was a subset of the coreMOF database that already uses the most relevant features for metal organic frameworks agreed upon in the community. We complemented it with additional features for DFT calculated properties of relevance to the application we are targeting namely the direct capture of CO2 from air. Hence, the present clustering work resulted from a real use case where our team faced challenges in clustering Cu-MOFs candidates that called for the help of computer science and visualization expert colleagues to work out a solution to this problem.
7 Is a software implementation of the model provided such that it can be trained and tested with new data?

The authors do not provide the implementation of the code to reproduce figures and logDA methodology. The clustering done at the end of the paper needs human in the loop decision to be performed.
I recommend publishing a GitHub repository with the data and the code used to go from the features to the logDA plots.



Following the reviewer’s advice, the data scripts and code utilized in the manuscript are available at our github repository :
https://github.com/elfedwa/Visual-Clustering-of-Overlapping-Materials-Science-Data.git

8 Are baseline comparisons to simple/trivial models (for example, 1- nearest neighbour, random forest, most frequent class) provided?

I recommend applying the same methodology to usual clustering techniques to compare the outcomes (this has already been developed before). Simple k-means and ward algorithms could be tested. The manuscript would benefit from having these figures in SI.

We added the case of Kmean in the supplemental material. Our point is not to tell which clustering is the best, but to propose an alternative clustering pipeline which engages the end-user in the clustering result and data exploration, and yet can support discussion between experts to reach a consensus.


9 Is the code or workflow available in a public repository?

No mention of any repository in the manuscript.


corrected and provided
10 Are scripts to reproduce the findings in the paper provided?

No mention of any script in the manuscript.


I responded ‘Yes’ to the following question but have some comments:
corrected and provided

11 Are methods for representing data as features or descriptors clearly articulated, ideally with software implementations?

It would be helpful to share the data obtained is a future repository to enable the reproduction of the results. The methods used to obtain them are clearly described in the manuscript. Clear identification of the MOFs would be preferable.

corrected and provided

12 In conclusion, I think that the idea presented is valuable for publication but needs to be more carefully compared to existing techniques and would gain from being examined in different situations. The workflow does not have to be specifically applied to MOFs and could be applied to any virtual datasets.

Indeed the workflow can be applied to non material science datasets. We added to the conclusion :
“The present method could be applied to any virtual dataset. “




Round 2

Revised manuscript submitted on 09 Jan 2024
 

29-Jan-2024

Dear Dr El-Mellouhi:

Manuscript ID: DD-ART-09-2023-000179.R1
TITLE: A Human-in-the-loop Approach for Visual Clustering of Overlapping Materials Science Data

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Professor Jason Hein
Associate Editor, Digital Discovery


******
******

Please contact the journal at digitaldiscovery@rsc.org

************************************

DISCLAIMER:

This communication is from The Royal Society of Chemistry, a company incorporated in England by Royal Charter (registered number RC000524) and a charity registered in England and Wales (charity number 207890). Registered office: Burlington House, Piccadilly, London W1J 0BA. Telephone: +44 (0) 20 7437 8656.

The content of this communication (including any attachments) is confidential, and may be privileged or contain copyright material. It may not be relied upon or disclosed to any person other than the intended recipient(s) without the consent of The Royal Society of Chemistry. If you are not the intended recipient(s), please (1) notify us immediately by replying to this email, (2) delete all copies from your system, and (3) note that disclosure, distribution, copying or use of this communication is strictly prohibited.

Any advice given by The Royal Society of Chemistry has been carefully formulated but is based on the information available to it. The Royal Society of Chemistry cannot be held responsible for accuracy or completeness of this communication or any attachment. Any views or opinions presented in this email are solely those of the author and do not represent those of The Royal Society of Chemistry. The views expressed in this communication are personal to the sender and unless specifically stated, this e-mail does not constitute any part of an offer or contract. The Royal Society of Chemistry shall not be liable for any resulting damage or loss as a result of the use of this email and/or attachments, or for the consequences of any actions taken on the basis of the information provided. The Royal Society of Chemistry does not warrant that its emails or attachments are Virus-free; The Royal Society of Chemistry has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Please rely on your own screening of electronic communication.

More information on The Royal Society of Chemistry can be found on our website: www.rsc.org




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license