From the journal Digital Discovery Peer review history

Machine learning platform for determining experimental lipid phase behaviour from small angle X-ray scattering patterns by pre-training on synthetic data

Round 1

Manuscript submitted on 22 Oct 2021
 

16-Jan-2022

Dear Dr Gould:

Manuscript ID: DD-ART-10-2021-000025
TITLE: Machine Learning Platform for Determining Experimental Lipid Phase Behaviour from Small Angle X-ray Scattering Patterns by Pre-training on Synthetic Data

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy from CASRAI, https://casrai.org/credit/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Professor Jason Hein
Associate Editor, Digital Discovery

************


 
Reviewer 1

This well written paper concerns the development of a machine learning tool for the automatic identification of small angle scattering patterns measured from lipid phases.
The premise of the work is sound - as the authors point out, high throughput methodologies commonly employed at beamlines mean real time analysis of data is challenging, if not impossible, making "on the fly" adjustments to sample conditions unworkable.
Issues associated with the coexsistance of phases can also greatly hinder unambiguous phase assignment. Therefore any tool which helps with this is welcomed. There are currently freely available tools for the automatic assignment of peaks in lipid saxs data (see https://github.com//csbrasnett/lipidsaxs and its application here https://doi.org/10.1016/j.jcis.2020.04.015)
but the advantage of the work shown in this paper is the speed and reliability of the ML.

I have very few questions or suggestions to improve this work, however my minor comments are below, most of them are just simply for my own interest, but the authors may want to touch on them as possible future developments.
The authors state that they modifed their model to include non-peak specific features such as those created by polycarbonate scattering from capillaries, which is a great inclusion, as is peak broadening. How does the model cope if there is the presence for example of a broad peak which might be brought about by the presence of an L3 phase (not uncommon in regions where the flattening of the cubic surface is occurring in a transition from cubic to lamellar? I appreciate that at this stage in development, not every single eventuality can be accounted for, and that the tool can be iterated, however a major challenge in identifying L3 phases is that they are simply a broad peak indicating a bilayer correlation length and can be overlooked by manual SAXS analysis.

The expectation is that lipid phases are unoriented with respect to the beam, and as such present a full ring pattern for analysis. However in certain situations, samples are aligned with one axis angled specifically to the beam (for example through the application of shear, or via diffusion methods, see Oka's work for some elegant examples). Depending on the specific orientation, certain reflections can be missing, and the full ring pattern is replaced by diffraction spots. Can these be accounted for in the analysis here?

Finally, the paper appears to be concerned with bulk phases (unsurprisingly as they are probably the most commonly measured). However given the increased interest in lipid nanoparticles for drug delivery, many of these will fall into a Q range where the nanoparticle size and shape itself will contribute to the scattering - has this been accounted for (or will it not matter?).

Overall I am satisfied that this paper is an excellent addition to the literature and has provided a method which will be of great benefit to a growing community. I recommend it for publication.

Reviewer 2

I do not quite agree that "assigning the lipid phase observed from a SAXS pattern has been the limiting factor in many lipid researchers workflows", but I do see the true bottleneck instead in calculating electron density maps from SAXS data. However, since indexing is definitely the first step towards an AI-supported determination of electron density maps, your work might become an import puzzle piece on its way. Apart from not entirely sharing your vision, I have only a few comments for improvement of your very well written paper.

For appropriate understanding of data presented in figures 2 and 3 (as well as figures S2 and S3), please briefly describe and define properly PCA, t-SNE and UMAP.

Your reference list omits quite a lot of pioneering work, when it comes to the analysis of the lamellar, inverse hexagonal or bicontinuous cubic phases. Would be nice to see some of them cited in the reference list.

Minor: Write "Holyst" throughout; most of the fonts of the figures in the SI need to be enlarged.

Reviewer 3

My review focuses solely on the data and/or code aspects of the manuscript.

The authors provide the code for synthetic data generation and model training before fine tuning on experimental data in a publicly accessible GitHub repository. The code is clear and easy to follow, and the authors provide a helpful README to guide users through implementing it. However, the rest of their results and analyses cannot be reproduced because the data and code is not available.

I found the following elements to be missing:
- No experimental SAXS data is provided. This is important since the authors pre-train their model on synthetic data and then fine-tune it on experimental data.
- No code is provided for the “data exploration” sections (e.g. data analysis with t-SNE, UMAP and PCA).

Minor discrepancies:
- Package versions differ between the manuscript and GitHub repository.
- In the SI, the authors state that Generate_Hexagonal.py should create 10000 samples. In my test, it only generated 2000.

Minor bugs:
- The requirements.txt file contains a reference to the ‘skimage’ package, which cannot be installed. I believe skimage and sckikit-image (listed earlier in the file) are the same, so the skimage requirement can be removed.
- In the Generate_[...].py and Process_[…].py scripts, the np.save() command doesn’t work because the destination directories specified (‘Synthetic_raw’ and ‘Synthetic_Processed’) do not exist in the repo in its current state.


Notes from reviewer checklist:
1a. Experimental SAXS data is not provided.
3b. The model is trained on real and/or synthetic data, so there are no standard features against which to compare.
6b. Although scripts to reproduce synthetic data generation are publicly accessible in the GitHub repo, scripts for further fine-tuning and data exploration are missing, as well as the experimental SAXS data on which the model is fine-tuned.


 

Please see below but we are also including a word file where we have used red to highlight our changes

Response to Reviewers
Manuscript ID: DD -ART- 10-2021-000025
TITLE: Machine Learning Platform for Determining Experimental Lipid Phase Behaviour from Small Angle X-Ray Scattering Patterns by Pre-Training on Synthetic Data

Response to all reviewers:
The authors would like to thank the reviewers for their comments and time spent reviewing our manuscript. The feedback has been very well received among the authors and we have provided a point-by point response to all comments and questions. Our feedback is included in red, and any additions to the manuscript have been highlighted in this letter as well as in the attached, edited manuscript.

Referee: 1

Main comments:

This well written paper concerns the development of a machine learning tool for the automatic identification of small angle scattering patterns measured from lipid phases.
The premise of the work is sound - as the authors point out, high throughput methodologies commonly employed at beamlines mean real time analysis of data is challenging, if not impossible, making "on the fly" adjustments to sample conditions unworkable.
Issues associated with the coexsistance of phases can also greatly hinder unambiguous phase assignment. Therefore any tool which helps with this is welcomed. There are currently freely available tools for the automatic assignment of peaks in lipid saxs data (see https://github.com//csbrasnett/lipidsaxs and its application here https://doi.org/10.1016/j.jcis.2020.04.015)
but the advantage of the work shown in this paper is the speed and reliability of the ML.
Overall I am satisfied that this paper is an excellent addition to the literature and has provided a method which will be of great benefit to a growing community. I recommend it for publication.
We are pleased the reviewer found the work well written, that it contributes well to the literature and that they recommend our work for publication. In light of the reviewers’ main comments, we have provided additional discussion with regard to other software tools within the literature.
We have introduced the following into the manuscript:
“With human analysis, this can introduce classification bias, which can be significant in complex SAXS patterns, such as those that demonstrate phase coexistence in which individual patterns may exist together simultaneously. Several software tools such as AXcess31, DPDAK,32 or DAWN33 have been developed to assist in the processing of large amounts of data, including 2D diffraction images. In addition, basic analysis such as peak finding and fitting may also be performed with these programmes. However, these tools are not optimized to identify lipid phases and/or to deal with samples exhibiting multiple coexisting phases. This need has been partially addressed in software suites such as Scatter34 or SCryPTA,35 which help the user to identify the lipid mesophase by displaying the peak positions for a given phase and first peak, yet this requires manual validation preventing high throughput analysis of the SAXS data. Other tools, such as those described by Joseph et al.36 or Dully et al.37 compare the ratios between peak positions with respect to those of known lipid polymorphs, allowing to extract the lipid mesophase without the need of user input. However, such approaches have limited success when dealing with coexisting lipid phases. To the best of our knowledge, there exists no method to quantitatively assign the degree of coexistence within a sample.”

Minor comments:

The authors state that they modified their model to include non-peak specific features such as those created by polycarbonate scattering from capillaries, which is a great inclusion, as is peak broadening. How does the model cope if there is the presence for example of a broad peak which might be brought about by the presence of an L3 phase (not uncommon in regions where the flattening of the cubic surface is occurring in a transition from cubic to lamellar? I appreciate that at this stage in development, not every single eventuality can be accounted for, and that the tool can be iterated, however a major challenge in identifying L3 phases is that they are simply a broad peak indicating a bilayer correlation length and can be overlooked by manual SAXS analysis.

The reviewer makes an excellent point with regard to the L3 phase. The peak from the L3 phase is usually at q values lower than the one corresponding to polycarbonate holder (~0.4 q). In the native model, which is fine tuned on lamellar and hexagonal patterns, such a feature will likely affect the probability of phase assignment. Our approach is highly flexible, and so we anticipate that our model could either be simply fine-tuned specifically towards the L3 phase by training on a user’s L3 phase experimental data, or by changing the synthetic generation parameters and deploying a similar pipeline as the one presented in the paper and included in our github repository.
The expectation is that lipid phases are unoriented with respect to the beam, and as such present a full ring pattern for analysis. However, in certain situations, samples are aligned with one axis angled specifically to the beam (for example through the application of shear, or via diffusion methods, see Oka's work for some elegant examples). Depending on the specific orientation, certain reflections can be missing, and the full ring pattern is replaced by diffraction spots. Can these be accounted for in the analysis here?
The reviewer’s insight is welcomed. Although this has not been directly tested, we expect the model to be able to account for missing diffraction peaks as long as this does not introduce a pattern which can be assigned simultaneously to two phases (e.g. for an HII phase, missing the √3 reflection, when only 3 diffraction peaks are present, may lead to confusion with lamellar phase while this will not happen if the √4 is missed instead). Regarding the effect of diffraction spots vs diffraction rings, because our approach relies on the radial integration of the diffraction data, the 2D information is lost, yet the phase information contained in the diffraction spots is conserved when performing the 1D integration (see A. M. Seddon, G. Lotze, T. S. Plivelic and A. M. Squires, J. Am. Chem. Soc., 2011, 133, 13860–13863.). As with the response to the comment above, additional data processing within the synthetically generated data to account for such features within experimental datasets would be required.
Finally, the paper appears to be concerned with bulk phases (unsurprisingly as they are probably the most commonly measured). However, given the increased interest in lipid nanoparticles for drug delivery, many of these will fall into a Q range where the nanoparticle size and shape itself will contribute to the scattering - has this been accounted for (or will it not matter?).
By using the 2D feature map representation of the diffraction data, our approach highlights peak-like features while minimizing low prominence features, such as those described occurring in the Guinier region and Fourier regions of the SAXS spectra. In addition, our model has been optimized against hydrated stacks, where the correlation between the structural features is significantly higher than that of nanoparticle dispersions. We have not explicitly tested nanoparticles, however, as in our previous comment, our pipeline is modifiable, and we expect it to handle a variety of different scattering patterns including that of nanoparticles once it has been fine-tuned on sufficient amounts of such experimental data.

Referee: 2

Main comments:

I do not quite agree that "assigning the lipid phase observed from a SAXS pattern has been the limiting factor in many lipid researchers workflows", but I do see the true bottleneck instead in calculating electron density maps from SAXS data. However, since indexing is definitely the first step towards an AI-supported determination of electron density maps, your work might become an import puzzle piece on its way. Apart from not entirely sharing your vision, I have only a few comments for improvement of your very well written paper.

For appropriate understanding of data presented in figures 2 and 3 (as well as figures S2 and S3), please briefly describe and define properly PCA, t-SNE and UMAP.

Your reference list omits quite a lot of pioneering work, when it comes to the analysis of the lamellar, inverse hexagonal or bicontinuous cubic phases. Would be nice to see some of them cited in the reference list.

We thank the reviewer for their comments and insight and are glad that they think the paper is well written. We share the view that our work could be built into other applications and are hopeful it may enable such utility within the community. In light of the main comments from the reviewer, we have added the following additional lines to the manuscript:

We have introduced the following to the manuscript:
“… to explore our datasets. These dimensionality reduction techniques allow the visualisation of multiple features within a large dataset on a 2D latent space. PCA is a linear model which changes the basis of the high dimensional data by maximising the variance. t-SNE and UMAP tend to perform better on non-linear data57,58. We therefore use all three techniques to be able to observe a representative view of the high dimensional SAXS data. …”
And:

“...The relative position of the Bragg peaks within a sample enables the identification of a lipid phase through characteristic Bragg peak ratios,17 and detailed analysis of the diffraction patterns allows to extract structural information of lamellar,18-20 cubic21-24 and hexagonal3,25-26 phases at molecular scale. Through the use of...”
This statement includes the additional citations numbered below.
Minor comments
Write "Holyst" throughout; most of the fonts of the figures in the SI need to be enlarged.
All figures within the SI have been edited in line with this comment. We thank the reviewer for pointing this out.
Referee: 3

Main comments

The authors provide the code for synthetic data generation and model training before fine tuning on experimental data in a publicly accessible GitHub repository. The code is clear and easy to follow, and the authors provide a helpful README to guide users through implementing it. However, the rest of their results and analyses cannot be reproduced because the data and code is not available.

I found the following elements to be missing:
- No experimental SAXS data is provided. This is important since the authors pre-train their model on synthetic data and then fine-tune it on experimental data.
- No code is provided for the “data exploration” sections (e.g. data analysis with t-SNE, UMAP and PCA).
We thank the reviewer for their comments and are pleased they were able to utilise our pipeline code from the github repository. We are glad they find the code easy to follow and that our README file is clear. In light of the reviewers main comments, we have made significant edits to our github repository providing the full experimental dataset and added a data visualization script (data_visualization.py) that performs dimensionality reduction on both synthetic and experimental data and plots:
• Raw SAXS pattern
• UMAP, T-SNE and PCA of synthetic and experimental data
• 2D feature map

Minor comments

Package versions differ between the manuscript and GitHub repository.
We have now changed the package versions in the manuscript to match those in the requirements.txt file on the GitHub repository.
In the SI, the authors state that Generate_Hexagonal.py should create 10000 samples. In my test, it only generated 2000.
This has now been fixed by changing the parameter linspace in the saxspy module.

The requirements.txt file contains a reference to the ‘skimage’ package, which cannot be installed. I believe skimage and sckikit-image (listed earlier in the file) are the same, so the skimage requirement can be removed.
In the Generate_[...].py and Process_[…].py scripts, the np.save() command doesn’t work because the destination directories specified (‘Synthetic_raw’ and ‘Synthetic_Processed’) do not exist in the repo in its current state.

Empty directories “Synthetic_raw” and “Synthetic_Processed” have been added to the repository with a readme in each explaining what data should be saved there.

Notes from reviewer checklist:
1a. Experimental SAXS data is not provided.
3b. The model is trained on real and/or synthetic data, so there are no standard features against which to compare.
6b. Although scripts to reproduce synthetic data generation are publicly accessible in the GitHub repo, scripts for further fine-tuning and data exploration are missing, as well as the experimental SAXS data on which the model is fine-tuned.
We have satisfied all of the above criteria with our edited github repository and thank the reviewer for their careful review and replication of our pipeline.

References:
3 J. M. Seddon, BBA - Rev. Biomembr., 1990, 1031, 1–69
17 A. I. I. Tyler, R. V Law and J. M. Seddon, Methods Mol. Biol., 2015, 1232, 199–225.
18 M. C. Wiener, R. M. Suter and J. F. Nagle, Biophys. J., 1989, 55, 315–325.
19 J. F. Nagle and M. C. Wiener, Biophys. J., 1989, 55, 309–313.
20 R. Zhang, R. M. Suter and J. F. Nagle, Phys. Rev. E, 1994, 50, 5047–5060.
21 P. Garstecki and R. Hołyst, Langmuir, 2002, 18, 2529–2537.
22 P. Garstecki and R. Hołyst, Langmuir, 2002, 18, 2519–2528.
23 P. Garstecki and R. Hołyst, J. Chem. Phys., 2000, 113, 3772–3779.
24 E. Shyamsunder, S. M. Gruner, M. W. Tate, D. C. Turner, P. T. C. So and C. P. S. Tilcock, Biochemistry, 1988, 27, 2332–2336.
25 M. P. K. Frewein, M. Rumetshofer and G. Pabst, J. Appl. Crystallogr., 2019, 52, 403–414.
26 D. C. Turner and S. M. Gruner, Biochemistry, 1992, 31, 1340–1355.
31 J. M. Seddon, A. M. Squires, C. E. Conn, O. Ces, A. J. Heron, X. Mulet, G. C. Shearman, R. H. Templer, H. F. Gleeson, V. Percec, S. T. Lagerwall, P. Palffy-Muhoray and C. R. Safinya, Philos. Trans. R. Soc. A Math. Phys. Eng. Sci., 2006, 364, 2635–2655.
32 G. Benecke, W. Wagermaier, C. Li, M. Schwartzkopf, G. Flucke, R. Hoerth, I. Zizak, M. Burghammer, E. Metwalli, P. Müller-Buschbaum, M. Trebbin, S. Förster, O. Paris, S. V. Roth and P. Fratzl, J. Appl. Crystallogr., 2014, 47, 1797–1803.
33 M. Basham, J. Filik, M. T. Wharmby, P. C. Y. Chang, B. El Kassaby, M. Gerring, J. Aishima, K. Levik, B. C. A. Pulford, I. Sikharulidze, D. Sneddon, M. Webber, S. S. Dhesi, F. Maccherozzi, O. Svensson, S. Brockhauser, G. Náray and A. W. Ashton, J. Synchrotron Radiat., 2015, 22, 853–858.
34 S. Förster, L. Apostol and W. Bras, J. Appl. Crystallogr., 2010, 43, 639–646.
35 R. Dias de Castro, B. Renata Casadei, B. Vasconcelos Santana, M. Lotierzo, N. F. de Oliveira, B. Malheiros, P. Mariani, R. C.K. Kaminski and L. R. S. Barbosa, bioRxiv, 2019, 1–33.
36 J. S. Joseph, W. Liu, J. Kunken, T. M. Weiss, H. Tsuruta and V. Cherezov, Methods, 2011, 55, 342–349.
37 M. Dully, C. Brasnett, A. Djeghader, A. Seddon, J. Neilan, D. Murray, J. Butler, T. Soulimane and S. P. Hudson, J. Colloid Interface Sci., 2020, 573, 176–192.
57 L. Van Der Maaten and G. Hinton, Visualizing Data using t-SNE, 2008, vol. 9.
58 L. McInnes, J. Healy and J. Melville, 2020, arXiv:1802.03426 [stat.ML].






Round 2

Revised manuscript submitted on 28 Jan 2022
 

28-Jan-2022

Dear Dr Gould:

Manuscript ID: DD-ART-10-2021-000025.R1
TITLE: Machine Learning Platform for Determining Experimental Lipid Phase Behaviour from Small Angle X-ray Scattering Patterns by Pre-training on Synthetic Data

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

Thank you for publishing with Digital Discovery, a journal published by the Royal Society of Chemistry – connecting the world of science to advance chemical knowledge for a better future.

With best wishes,

Professor Jason Hein
Associate Editor, Digital Discovery


******
******

Please contact the journal at digitaldiscovery@rsc.org

************************************

DISCLAIMER:

This communication is from The Royal Society of Chemistry, a company incorporated in England by Royal Charter (registered number RC000524) and a charity registered in England and Wales (charity number 207890). Registered office: Burlington House, Piccadilly, London W1J 0BA. Telephone: +44 (0) 20 7437 8656.

The content of this communication (including any attachments) is confidential, and may be privileged or contain copyright material. It may not be relied upon or disclosed to any person other than the intended recipient(s) without the consent of The Royal Society of Chemistry. If you are not the intended recipient(s), please (1) notify us immediately by replying to this email, (2) delete all copies from your system, and (3) note that disclosure, distribution, copying or use of this communication is strictly prohibited.

Any advice given by The Royal Society of Chemistry has been carefully formulated but is based on the information available to it. The Royal Society of Chemistry cannot be held responsible for accuracy or completeness of this communication or any attachment. Any views or opinions presented in this email are solely those of the author and do not represent those of The Royal Society of Chemistry. The views expressed in this communication are personal to the sender and unless specifically stated, this e-mail does not constitute any part of an offer or contract. The Royal Society of Chemistry shall not be liable for any resulting damage or loss as a result of the use of this email and/or attachments, or for the consequences of any actions taken on the basis of the information provided. The Royal Society of Chemistry does not warrant that its emails or attachments are Virus-free; The Royal Society of Chemistry has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Please rely on your own screening of electronic communication.

More information on The Royal Society of Chemistry can be found on our website: www.rsc.org




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license