Peer review - Machine learning enabling high-throughput and remote operations at large-scale user facilities

24-Mar-2022

Dear Dr Olds:

Manuscript ID: DD-ART-02-2022-000014
TITLE: Machine learning enabling high-throughput and remote operations at large-scale user facilities

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports are mixed. Overall the reviews are quite strong, but one reviewer raises some serious critiques which should be addressed in your revision. The other reviewers offer clarifications. In addition, one reviewer provided the following minor typographical/editing errors, which I am providing to you below. Please take these into consideration in your revision.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy from CASRAI, https://casrai.org/credit/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

*****
Other Typographical Errors (raised in communication from reviewers)

p. 5, second column, first full paragraph
(iii) and supervised learning for predicting a functional labels or values associate with a data stream

should read
“a functional label or values”
or
“functional labels or values”

p.6 of 17, Section 2, Second paragraph, left column
“Once access the historical or active data stream”
should be
Once access to the historical or active data stream

Would be nice if it presented some limit on data, some characterization of how long it took to analyze some dataset so a person could see how well it might scale.

p 11 of 17, left column
“The features were calculated from the times series following two preprocessing steps”
should probably be
“The features were calculated from the time series following two preprocessing steps”

p. 12 of 17, First partial paragraph, last sentence.
“We will here demonstrate an application of supervised learning employed on a beamline to classify data quality in a bianary fasion as simply ‘good’ or ‘bad’ data.”
should read
“...in a binary fashion, as simply 'good' or 'bad' data”

p. 13 of 17, right hand column, last paragraph
“with an tell-report-ask interface” should be
“with a tell-report-ask interface”

p. 14 of 17, left column, second full paragraph
“With such an interface, the model can access the result files from a folder and return a prediction of whether the measurement considered an anomaly. “
should be
“whether the measurement is considered an anomaly”

************

Reviewer comments

Reviewer 1

This is a very well written manuscript describing some excellent research that addresses the very real challenge of how machine learning (ML) methods can be used to enhance the scientific programme of large scale user facilities.

It is recognised that ML methods can be applied to a variety of scientific data to speed up data analysis and to automate systems using a data driven approach. For user facilities there is a real challenge to develop and deploy useful functionality that can be used by facility users, who are not necessarily experts in data science.

The manuscript provides a clear description of how 3 key ML methods can be integrated into a data pipeline and describes the interface developed between elements to allow pipelines to be deployed in a reproducible manner that fits well within facility data systems and services.

The data science and software development is applied within the context of photon science experiments at NSLSII including integration with the NSLSII data acquisition system.
The scientific examples are all well chosen and it is clear that ML methods provide real value added for the applications and methods chosen. The methods and examples cover the broad areas of experimental modalities, diffraction, spectroscopy and imaging.

What is of great value to the those working in this area, or facilities wishing to utilise ML is the inclusion of the source code on github, there are two key points, the repository is actively managed and the code is professionally engineered.
The project is open source and could be further developed by the authors or others to expand the use cases / methods.

I recommend publication without further revisions.

Reviewer 2

This is an excellent and timely paper. The light sources (and other large-scale experimental facilities) are beginning to experience a data tsunami in which more data are taken than can be easily analyzed using traditional methods. AI/ML methods show great promise in shortening the time between data collection and the production of analyzed results. AI/ML will also increasingly be used to drive experiments and optimize data-taking, but only if researchers can understand how to appropriately apply ML. This paper provided a nice clear description of how to apply ML approaches to some specific x-ray measurement techniques.

I would have liked a little more information about how to prevent overfitting or underfitting and a clearer presentation of the guidelines for dividing a dataset into training/validation sets.

There was a statement at the end that said "each of the models could be trained on a personal computer in a matter of minutes or even seconds". How large was that dataset and what was the nature of the dataset (point values, waveforms, 2D images, etc)? I think the size of the dataset might have been mentioned earlier, but it would be helpful to restate it here to give a sense of how this might scale to larger datasets.

Reviewer 3

The paper by Konstantinova targets machine learning at large-scale facilities. In particular at the NSLS II synchrotron. The authors highlight the importance of integrating machine learning seamlessly and close to the beamlines. This is an important topic and will become even more important as the data rates at experimental techniques increase sharply. The first figure in the publication presents a flowchart of the overall development of an AI solution for beamlines. Unfortunately, the paper does not present much more detail on the integration of ML solutions into BlueSky, descriptions of how this is used by scientists, etc. The paper focuses on very common definitions and language in ML and only on page 4 starts with unsupervised learning. Konstantinova applies PCA and NMF to PDF spectra of molten NaCl:CrCl3 across a variety of temperatures. The authors present the NMF components and the reconstruction of the original spectra. The authors describe the “failing of the model” or better surrogate model, however as they described earlier, it depends on the number of NMF components. It would have been interesting to see autoencoders, the author mentions the approach at the beginning of 3. Since autoencoders with linear activation are equivalent to PCA, this could have been very interesting. Again, it is not clear at all how this was implemented with BlueSky. The anomaly detection is pretty interesting (how is this implemented at the beamlines, what information does the user see during beamtime, is this real-time, etc.) It seems a strange choice that after the previous section, non of the previous approaches are at least tried to detect anomalies. In summary, an interesting paper could be developed on the anomaly approach and in addition maybe a more modern approach to the PDF data.

Reviewer 4

The role of machine learning at large-scale user facilities is becoming ever more critical to the ability to process large amounts of valuable data quickly, and to reap the benefits of such facilities for new scientific discoveries. Machine learning is the next “killer app” in this space. In the user facility space, focus is pivoting away from traditional exploration and discovery modes and focusing on the use of AI/ML and other methods to plan and execute experiments, to analyze and interpret data, and to synthesize results to propose new experiments. This is a topic that can’t be ignored.

The paper is well-written and explores this important space in great detail. The authors developed a framework to easily deploy and execute such methods at many instruments, and tested a number of ML methods, showing their significance.

Author response

Dear Editor,
Thank you and the reviewers for their careful and thoughtful read of our manuscript, “Machine learning enabling high-throughput and remote operations at large-scale user facilities”. We appreciate the reviewers' strong support of the manuscript and have revised it accordingly to meet their suggestions. We here document our responses to the reports. We have also included an annotated version of the manuscript which highlights what changes were made, including all noted typographical errors.

Response to Reviewer 1:

---This is a very well written manuscript describing some excellent research that addresses the very real challenge of how machine learning (ML) methods can be used to enhance the scientific programme of large scale user facilities. It is recognised that ML methods can be applied to a variety of scientific data to speed up data analysis and to automate systems using a data driven approach. For user facilities there is a real challenge to develop and deploy useful functionality that can be used by facility users, who are not necessarily experts in data science.

We strongly agree with the reviewer’s assessment of the challenge facing user facilities in this regard.

---The manuscript provides a clear description of how 3 key ML methods can be integrated into a data pipeline and describes the interface developed between elements to allow pipelines to be deployed in a reproducible manner that fits well within facility data systems and services.

We thank the reviewer for their assessment. It was our intention to not so much showcase the ideal AI/ML methods to be applied under different circumstances, but rather a clear demonstration of how different approaches could be taken to integrate with an analysis pipeline.

---The data science and software development is applied within the context of photon science experiments at NSLSII including integration with the NSLSII data acquisition system. The scientific examples are all well chosen and it is clear that ML methods provide real value added for the applications and methods chosen. The methods and examples cover the broad areas of experimental modalities, diffraction, spectroscopy and imaging.

We appreciate the reviewer’s recognition, as we had endeavored to ensure the examples covered broad modalities.

---What is of great value to the those working in this area, or facilities wishing to utilise ML is the inclusion of the source code on github, there are two key points, the repository is actively managed and the code is professionally engineered. The project is open source and could be further developed by the authors or others to expand the use cases / methods.

We completely agree with the reviewer, it was important to us that these methods be as open and accessible as possible. As such, we’ve continued to curate the code base of the examples presented in the paper – particularly in reference to suggestions made by reviewer #3. We hope this manuscript can help catalyze the development of a community focused on developing AI/ML methods for large scale user facilities.

--- I recommend publication without further revisions.

Response to Reviewer 2:

--- This is an excellent and timely paper. The light sources (and other large-scale experimental facilities) are beginning to experience a data tsunami in which more data are taken than can be easily analyzed using traditional methods. AI/ML methods show great promise in shortening the time between data collection and the production of analyzed results. AI/ML will also increasingly be used to drive experiments and optimize data-taking, but only if researchers can understand how to appropriately apply ML. This paper provided a nice clear description of how to apply ML approaches to some specific x-ray measurement techniques.

We thank the reviewer for their support and agree with their assessment of the daunting data-situation all such large-scale user facilities are facing.

--- I would have liked a little more information about how to prevent overfitting or underfitting and a clearer presentation of the guidelines for dividing a dataset into training/validation sets.

This is an excellent suggestion. In addition to referring a reader to Ref. 28, we have revised the text in section 2 to address this issue, with text added that states (p. #3):

“The data for the training and validation sets should come from similar distributions to make the comparison of model performance for them meaningful. However, the data generation process needs to be taken into account when splitting the data. For example, measurements repeated at similar conditions during the same experiment should be attributed to the same set. There is no standard rule for the splitting ratio as long as all sets are representative of the data distribution. 60:20:20 split for training, validation and test sets, respectively, is common. If a test (holdout) set is not available during the model training, a 80:20 split between the sets is typical. However, for larger amount of data, validation set can be smaller.”

--- There was a statement at the end that said "each of the models could be trained on a personal computer in a matter of minutes or even seconds". How large was that dataset and what was the nature of the dataset (point values, waveforms, 2D images, etc)? I think the size of the dataset might have been mentioned earlier, but it would be helpful to restate it here to give a sense of how this might scale to larger datasets.

We have revised the text to clearly state the datasets size and the nature of the data at this point in the discussion, as this is an important consideration for any reader considering development of their own approaches. We have also added text to section 2 which describes the challenge and recommended approach to dealing with limited labeled datasets – a common occurrence for beamline users – which states (p. #3):

“The size of the dataset for a model development depends on the problem, the intended model and data availability. It is important that the training data captures the diversity and relative frequency of the intended use cases. Many models benefit from having large amount training data, which are available at synchrotron user facilities. However, labeled data remain a limited resource. The cost of labeling additional data should be weighted against the expected performance increase in each particular case. Having too large dataset can also pose a problem for model development. Some algorithms, like anomaly detection, do not perform well for very large data sets. More generally, ability of a model to learn new information saturates at certain amount of data, while demand for computational resources keep growing. Datasets used in this work are less than 1000 points.”

Response to Reviewer 3:

--- The paper by Konstantinova targets machine learning at large-scale facilities. In particular at the NSLS II synchrotron. The authors highlight the importance of integrating machine learning seamlessly and close to the beamlines. This is an important topic and will become even more important as the data rates at experimental techniques increase sharply.
We agree with the reviewer on the importance of this topic, and particularly wanted the contribution to serve the growth of this community, which will likely be driven in part by facility users with limited background or formal training in machine learning methods.

--- The first figure in the publication presents a flowchart of the overall development of an AI solution for beamlines. Unfortunately, the paper does not present much more detail on the integration of ML solutions into BlueSky, descriptions of how this is used by scientists, etc.

While the provided Github repository was meant to present the detailed integration of the ML solutions with Bluesky (being fully demonstrated therein), we agree with the reviewer that the details of this integration in the body of the manuscript could be expanded. We have revised the manuscript to lay out the approach used both conceptually and with psuedocode (p. #11). Readers are also explicitly advised that all details of the implementation can be found in the provided Github repository. The revised text states the following:

p. #5
“As implemented (see Code Availability), the decomposition and clustering algorithms can be readily deployed onto other beamlines or types of measurement that produce relatively low dimensional data (1-d or small 2-d)."

p. #11:
“ Here we outline how each of the proceeding sections was implemented, to demonstrate the diverse integration modes across the facility. Each of these deployment techniques and training strategies can be found in the accompanying code repository. Firstly, it is useful to have a generic interface to expect with AI models, so that similar models can be deployed in different experimental processes regardless of other design decisions. This separates the design of the agent from the management of data streaming, in-line callbacks, or application specific techniques. Following recent work in adaptive experiments at NSLS-II and the developments of Bluesky Adaptive70,71"

p. #11:
“A complete tutorial using the tell–ask components to deploy multiple AI models can be found at reference 71, and all the models presented here have been made available (see Code Availability statement)."

p. #11:
“Alternatively, this can be implemented using the following pseudo code with a databroker catalog for the most recent measurement.
<…>
Here, the AnomalyAgent() is designed in a way that it only ingests the experiment identification number and the dictionary of the time series, without being tied to the way that the dictionary is created. The data can be processed from the catalog or from hdf5 files in folder."

---The paper focuses on very common definitions and language in ML and only on page 4 starts with unsupervised learning. Konstantinova applies PCA and NMF to PDF spectra of molten NaCl:CrCl3 across a variety of temperatures. The authors present the NMF components and the reconstruction of the original spectra. The authors describe the “failing of the model” or better surrogate model, however as they described earlier, it depends on the number of NMF components.

We have clarified and expanded the description of “model failing” in the context of this example, such that the text now states:

“As evidenced by the increased relative error, the model is failing to fit the data in the region from 400-500 C. This can be attributed to the existence of spurious anomalous features in the diffraction data during the co-existence of the amorphous and crystalline phases40 that would require far more than 4 components to fit well. However, as the application of NMF here is being used to highlight and cluster regions of interest in the data, and not necessarily extract physically real components, the limited 4-component model is effective.”

--- It would have been interesting to see autoencoders, the author mentions the approach at the beginning of 3. Since autoencoders with linear activation are equivalent to PCA, this could have been very interesting.

We agree with the reviewer that autoencoder models are useful and important for scientific data applications, such as visualization, dimensionality reduction and noise removal, as shown in references #38 and 72. However, our goal was not to show all possible methods, but rather to show examples of model development cycle for several use cases, focusing broadly on the how-and-why of different approaches. We do not feel that including demonstration of autoencoder models in this manuscript is of great value to the majority of readers and would prefer to leave it outside of the scope of this work.

--- Again, it is not clear at all how this was implemented with BlueSky.

We hope that the new language in the manuscript effectively describes the approach and points readers to the Github repository where this implementation is clearly shown.

--- The anomaly detection is pretty interesting (how is this implemented at the beamlines, what information does the user see during beamtime, is this real-time, etc.) It seems a strange choice that after the previous section, non of the previous approaches are at least tried to detect anomalies.

We appreciate the reviewer's interest in the development of anomaly detection for beamline operations. We see great potential for broad application of such methods across a variety of facility applications. We provided examples of 3 methods for anomaly detection, which we believe is sufficient for the introductory discussion contained within the manuscript. A detailed exploration of the best methods to be applied to accomplish optimal anomaly detection is beyond the scope of this contribution.

--- In summary, an interesting paper could be developed on the anomaly approach and in addition maybe a more modern approach to the PDF data.

We thank the reviewer for their thoughtful suggestions to improve the manuscript. We look forward to pursuing the areas of interest to them (and ourselves) in future work.

Response to Reviewer 4:

--- The role of machine learning at large-scale user facilities is becoming ever more critical to the ability to process large amounts of valuable data quickly, and to reap the benefits of such facilities for new scientific discoveries. Machine learning is the next “killer app” in this space. In the user facility space, focus is pivoting away from traditional exploration and discovery modes and focusing on the use of AI/ML and other methods to plan and execute experiments, to analyze and interpret data, and to synthesize results to propose new experiments. This is a topic that can’t be ignored.

As scientists on the frontlines of this rapidly evolving domain, we strongly agree with the reviewers assessment.

--- The paper is well-written and explores this important space in great detail. The authors developed a framework to easily deploy and execute such methods at many instruments, and tested a number of ML methods, showing their significance.

We thank the reviewer for their strong support of the manuscript.

Editor’s decision letter

18-May-2022

Dear Dr Olds:

Manuscript ID: DD-ART-02-2022-000014.R1
TITLE: Machine learning enabling high-throughput and remote operations at large-scale user facilities

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Joshua Schrier
Associate Editor, Digital Discovery

Reviewer comments

Reviewer 1

The authors have made specific changes to the manuscript in regards to referee #3 initial review which i suggest fully cover those comments.
As previously i recommend publication.

Reviewer 4

All suggestions and concerns have been satisfactorily addressed. The manuscript is suitable for publication.

From the journal Digital Discovery Peer review history

Round 1

Reviewer 1

Reviewer 2

Reviewer 3

Reviewer 4

Round 2

Reviewer 1

Reviewer 4

Transparent peer review