From the journal Digital Discovery Peer review history

A machine learning approach toward generating the focused molecule library targeting CAG repeat DNA

Round 1

Manuscript submitted on 21 Aug 2023
 

28-Nov-2023

Dear Dr Nakatani:

Manuscript ID: DD-COM-08-2023-000160
TITLE: Generation of a focused molecule library by machine learning targeting CAG repeat DNA

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 1

Nakatani et al. describe the use of a machine learning model to predict the label of a dataset containing compounds that target the CAG repeat DNA. Small molecule binding to CAG repeat DNA could lead to the discovery of drugs for various diseases like Huntington's disease and certain spinocerebellar ataxia. Although the topic is of interest to the drug discovery community, the manuscript has several issues that need to be fixed before being considered for publication.

1. The abstract is totally disconnected from the title (eg. Not even the target is mentioned) and should be improved.

2. The title is misleading. The authors did not generate a focused molecule library targeting the CAG repeat DNA. They just did a retrospective analysis, and used ML to classify binders from non-binders. In theory, the method could be used to select new compounds for testing, but this was not mentioned in the manuscript.
3. The computational methods section is poorly described and needs to be improved. Also, the number of non-binders shouldn’t be 1896?

4. Why did the authors choose to use random forest? Were other estimators evaluated? XGBoost might be a good option for unbalanced datasets. Also, why hasn't feature selection and hyperparameter optimization been attempted?

5. The authors state in the last paragraph of page 3 that removing the top 10 (or top 20) features leads to a slight decrease in all indices (Table S5), “suggesting that these top-ranked features influence the model performance”. This seems contradictory, I would expect that removing the top-ranked features would severely impact the performance of the model. Please explain.

6. Based on what the authors reached the conclusion that they could enhance the probability of finding hits from 5% to 19.5%?

7. I think it would be of interest to the readers if the authors insert a figure with the projection of the labeled data using t-SNE or PCA. Also, it would be nice to have a table with the average values of common physico-chemical properties (MW, logP, PSA, etc) for the hits and non-hits.

Reviewer 2

The present manuscript describes the generation of small molecules targeting nucleotides using machine learning. Using surface plasmon resonance and fluorescence, the authors gathered a library of 2000 molecules and measured their respective binding responses for interacting with immobilised sequences of d(CAG)40 DNAs. The compounds were grouped into 2 classes based on their response units (RU): hits with RU > 20 (positive class) and non-hits with RU < 20 (negative class), leading to 104 hits and 1876 non-hits. Each compound was encoded using 5270 molecular descriptors as features (Appendix B). The labelled small molecules were used to train a machine-learning classifier with a random forest algorithm and a train-test ratio 8:2 (Figure S2). They handled the class imbalance by down-sampling non-hits or over-sampling hits (SMOTE technique). They used classical classification metrics (Figure 2) to monitor the performances of their models. The performances of RF classifiers with downsampling are illustrated in Figure S3 and Table S3; reducing the non-hits resulted in increased false positives (high recall, low precision). Their downsampled model was entry 4 with a mean ROC AUC of 0.81 (Figure 3). The performances of oversampled classifiers are shown in Figure S4 and Table S4, with their best model being entry 2. Finally, they identified the critical features of their best model in Figure 3, measured by the SHAP values and the Gini index from the RF algorithm. They concluded that molecular size, molecular complexity, symmetry and polarity were critical to identifying hits targeting CAG repeat DNAs.

The manuscript is overall clear and well-written. The document illustrates the application of simple binary RF classifiers to distinguish between hits and non-hits interacting with DNA strings. Despite the research efforts to tackle class imbalance and high-quality experimental data, further work is needed to improve the research outcome's quality and their model's utility. Please consider the following remarks;

(1) Regression models? The authors have emphasised the need for more high-quality experimental data to build machine learning models. Several studies have used classifiers to counteract the noise of handling experimental data from different laboratories. The authors used high-quality binding response units from surface plasmon resonance and fluorescence to label their compounds in the present study. Why not build regression models with these experimental datasets? It would be very informative if the authors could provide the skewed distribution of RU values for the 2000 compounds, at least in supporting information.

(2) Only Random Forest (RF) algorithm. Recent studies have shown that tree-based algorithms like RF outperform deep learning algorithms for classification tasks using tabular data. The molecular descriptors used to describe the small molecules constitute tabular information. In this case, RF is an excellent choice (not the only one) to discriminate between hits and non-hits. Moreover, RF has a built-in feature importance, allowing an easy interpretation of the authors' model (Figure 4). Have the authors executed other classification algorithms (e.g. Decision tree, Gradient Boosting, XGBoost, Adaboost, SVM, Logistic regression, etc.) to demonstrate the superiority of RF?

(3) Other oversampling methods. ROSE and ADASYN are two other sampling methods. Can the authors compare their performances with SMOTE?

(4) External validation. It would be ideal if the authors could further support the utility of their ML-guided strategy by predicting hits from an external dataset of commercially available small molecules like ZINC and testing some for binding interactions with DNA probes.

Data review:

Are all data sources listed and publicly available?
The authors indicated the path to their data in the Supporting Information, but the data was not provided.

Are any potential biases in the source dataset reported and/or mitigated?
To be verified. The authors mentioned that most of their small molecules were non-hits, skewing the distribution of their training set.

Are baseline comparisons to simple/trivial models (for example, 1-nearest neighbour, random forest, most frequent class) provided?
The authors only used random forest for training their models, no comparative analysis.

Are baseline comparisons to current state-of-the-art provided?
None.

Does the data splitting procedure avoid data leakage (for example, is the same composition present in the training and test sets)?
The authors conducted a data split and mentioned a plausible bias in overlearning the non-hit class.

Is the code or workflow available in a public repository?
The authors provided the versions of Python libraries, not their codes.


 

Dear Editor,

Thank you for your evaluation and we appreciate both you and reviewers for the constructive comments to improve our ML-based studies. According to the reviewers’ comments as described below, we believe we could respond all the concerns raised by reviewers and the revised manuscript will be suitable for the publication in RSC Digital Discovery.

Response to Reviewers’ comments:

For Referee 1 comments,

Nakatani et al. describe the use of a machine learning model to predict the label of a dataset containing compounds that target the CAG repeat DNA. Small molecule binding to CAG repeat DNA could lead to the discovery of drugs for various diseases like Huntington's disease and certain spinocerebellar ataxia. Although the topic is of interest to the drug discovery community, the manuscript has several issues that need to be fixed before being considered for publication.

1. The abstract is totally disconnected from the title (e.g. Not even the target is mentioned) and should be improved.
and
2. The title is misleading. The authors did not generate a focused molecule library targeting the CAG repeat DNA. They just did a retrospective analysis, and used ML to classify binders from non-binders. In theory, the method could be used to select new compounds for testing, but this was not mentioned in the manuscript.

According to the reviewer’s comments 1 and 2, we revised the title and abstract as shown below.
Title: Machine learning approach toward generating the focused molecule library targeting CAG repeat DNA
Abstract: This study reports a machine learning-based classification approach with surface plasmon resonance (SPR) labelled data to generate a focused molecule library targeting CAG repeat DNA. By using an SPR screening and a machine learning classification model, we can improve the identification process of elucidating new hit compounds for the next round of wet lab experiments. The reported model increased the probability of hits from 5.2% to 20.6% in a focused molecule library with 92.9% correct hit classification (recall) and 99.3% precision for the non-hit class.

3. The computational methods section is poorly described and needs to be improved.

We apologize for the short description of the computational methods in the manuscript. We originally submitted this manuscript to RSC Chemical Communication and, therefore, we did not include the computational methods section. According to the comment, a more detailed description is added to the MATERIALS AND METHODS, 2. Machine learning section in ESI. (page 14)

Also, the number of non-binders shouldn’t be 1896?

We apologize for the typo. The number was corrected to 1896 (page 2, left column, line 18 from bottom)

4. Why did the authors choose to use random forest? Were other estimators evaluated? XGBoost might be a good option for unbalanced datasets.

Thank you for the constructive comment. We chose a random forest classifier because it generally works well with a small dataset. We apologize for the limited model selection information and comparison. A brief description of model selection was added at the end of the paragraph for the computational methods description in the main text. (page 2, left column, second paragraph, and last paragraph)

A comparison of several models would be beneficial for the study. We have briefly tried XGBoost previously but did not investigate much in detail. According to your suggestion, we revisited the XGBoost classification model and added the results and discussion in the revised version of the manuscript. (Table 1 and Fig. 3C) The following sentences were added in the main text (page 3, left column, the last sentence in the first paragraph)

“Under these optimal conditions found for the RF model, an XGBoost model applied to the same training and testing dataset provided almost the same scores in the average, but a better result in the highest recall of 0.93 (cf. 0.86 for the RF classifier).”

Also, why hasn't feature selection and hyperparameter optimization been attempted?

Regarding the feature selection, we included the analysis of the essential features by Gini index and SHAP values, as well as trying to gain insight into the details of top descriptors via the beeswarm plots. We used all features for the model training because we tried to avoid any human biases could be involved as much as possible and let RF algorithm select the essential features.

We attempted the simple grid search for hyperparameter tunning while the code part was not included in the uploaded script. (such as the number and size of each tree: n_estimators = [50, 100, 300, 500], max_depth = [5, 7, 10, 20, 30])

5. The authors state in the last paragraph of page 3 that removing the top 10 (or top 20) features leads to a slight decrease in all indices (Table S5), “suggesting that these top-ranked features influence the model performance”. This seems contradictory, I would expect that removing the top-ranked features would severely impact the performance of the model. Please explain.

Thank you for your question and letting us explain our thoughts. We initially thought as the reviewer commented, although the result was not what we expected. With these results in hand, we discussed why removing the top-ranked features did not have a significant impact on the performance. Typically, it seems reasonable that removing top-ranked features from simple models can significantly impair the performance. However, in our studies, a set of complex molecular descriptors of more than 5,000 were used. In a complex set of descriptors, there can be overlaps or similarities among features. We speculated that the information encapsulated by the removed top-ranked features might be redundant and can be captured by a combination of the remaining features. This observation was further supported by the two-dimensional UMAP visualization data presented in Fig. 5 in the main text (newly added). There is not that obvious separation between the hits and non-hits observed, indicating that the classification boundary might not depend on a few essential features. Based on the above discussion, we revised and added the sentence as described below.

“suggesting that these top-ranked features influence the model performance only weakly.” (“only weakly” was added) (page 4, right column, line 1)

“we speculated that the information encapsulated by the removed top-ranked features might be redundant and can be captured by a combination of the remaining features.” (page 4, right column, line 2)

“Finally, we used UMAP (Uniform Manifold Approximation and Projection)38 for dimensionality reduction to illustrate a spatial distribution of hit compounds with the top 20 features, where the hit compounds are represented as red dots (this should be in figure caption), showing that hit compounds were somewhat clustered towards the right side, but the observed pattern did not show a distinct separation of the clusters of hit from the non-hit compounds. These results supported the observation that the impact of the removal of top-ranked features was not significant in our studies, and also suggested that there are possibilities to improve classification by adjusting the labelling method and including some new molecular features. The current labelling method is only focused on the response strength and, therefore, we may fail to capture other important features such as the signal shape representing the binding thermodynamics and kinetics. The exploration of new molecular features would be of particular interest, although such features may depend on the target.” (newly added with Figure 5) (page 4, right column, second paragraph)

6. Based on what the authors reached the conclusion that they could enhance the probability of finding hits from 5% to 19.5%?

We apologize for the unclear explanation in the original manuscript. The number was calculated based on the 104 hit compounds of SPR experiments (5.2%, 104/2000 compounds) and the true hits obtained by RF classifier (19.5%, 24/123 compounds). In the revised manuscript, we calculated the number based on the true hits obtained by XGBoost classifier, showing 20.6% (26/126 compounds). (Fig. 3C, highest recall matrix). We focused on the highest recall but not the highest precision, because in the screening process, finding the true hits as many as possible would be the most important to increase the successful rate in the drug discovery.

Therefore, in the revised manuscript, we explicitly described the conclusion as follows:

“Theoretically, it is possible to enhance the probability of hits from 5.2% (104/2000 compounds by SPR experiments) in an original molecule library to 20.6% (26/126 compounds) in the hit class obtained in XGboost classification, which represents the focused library.”

7. I think it would be of interest to the readers if the authors insert a figure with the projection of the labeled data using t-SNE or PCA. Also, it would be nice to have a table with the average values of common physico-chemical properties (MW, logP, PSA, etc) for the hits and non-hits.

Thank you for the comment. We add a plot and the description of the top 20 features visualization using UMAP to decrease the feature dimensions in the main text as Fig. 5. (page 4, right column) We prefer UMAP to tSNE and PCA, as UMAP recently became a popular nonlinear visualization method. A plot of selected constitutional and physicochemical properties for the hits and non-hits is added to the SI in Fig. S10 and S11, respectively. (supporting information, page 9)

Referee: 2

Comments to the Author
The present manuscript describes the generation of small molecules targeting nucleotides using machine learning. Using surface plasmon resonance and fluorescence, the authors gathered a library of 2000 molecules and measured their respective binding responses for interacting with immobilised sequences of d(CAG)40 DNAs. The compounds were grouped into 2 classes based on their response units (RU): hits with RU > 20 (positive class) and non-hits with RU < 20 (negative class), leading to 104 hits and 1876 non-hits. Each compound was encoded using 5270 molecular descriptors as features (Appendix B). The labelled small molecules were used to train a machine-learning classifier with a random forest algorithm and a train-test ratio 8:2 (Figure S2). They handled the class imbalance by down-sampling non-hits or over-sampling hits (SMOTE technique). They used classical classification metrics (Figure 2) to monitor the performances of their models. The performances of RF classifiers with downsampling are illustrated in Figure S3 and Table S3; reducing the non-hits resulted in increased false positives (high recall, low precision). Their downsampled model was entry 4 with a mean ROC AUC of 0.81 (Figure 3). The performances of oversampled classifiers are shown in Figure S4 and Table S4, with their best model being entry 2. Finally, they identified the critical features of their best model in Figure 3, measured by the SHAP values and the Gini index from the RF algorithm. They concluded that molecular size, molecular complexity, symmetry and polarity were critical to identifying hits targeting CAG repeat DNAs.

The manuscript is overall clear and well-written. The document illustrates the application of simple binary RF classifiers to distinguish between hits and non-hits interacting with DNA strings. Despite the research efforts to tackle class imbalance and high-quality experimental data, further work is needed to improve the research outcome's quality and their model's utility. Please consider the following remarks;

(1) Regression models? The authors have emphasized the need for more high-quality experimental data to build machine learning models. Several studies have used classifiers to counteract the noise of handling experimental data from different laboratories. The authors used high-quality binding response units from surface plasmon resonance and fluorescence to label their compounds in the present study. Why not build regression models with these experimental datasets? It would be very informative if the authors could provide the skewed distribution of RU values for the 2000 compounds, at least in supporting information.

Thank you for commenting on a regression model.
We trained a simple regression model using a training dataset, then tested with a test dataset, and finally plotted the output RU values with respect to the ground-truth RU values via an MLP regression. The current result did not show an advantage as compared with a classification model, but we would like to consider the regression model in future studies further.

A skew distribution of RU is added to ESI in Fig. S2. Most of the response is located around zero. A sentence main text to mention this plot on page 2, left column, second paragraph, lines 9.

(2) Only Random Forest (RF) algorithm. Recent studies have shown that tree-based algorithms like RF outperform deep learning algorithms for classification tasks using tabular data. The molecular descriptors used to describe the small molecules constitute tabular information. In this case, RF is an excellent choice (not the only one) to discriminate between hits and non-hits. Moreover, RF has a built-in feature importance, allowing an easy interpretation of the authors' model (Figure 4). Have the authors executed other classification algorithms (e.g. Decision tree, Gradient Boosting, XGBoost, Adaboost, SVM, Logistic regression, etc.) to demonstrate the superiority of RF?

Thank you for this comment. Review 1 also suggested the same. Please see our response to reviewer 1 comment 4.

(3) Other oversampling methods. ROSE and ADASYN are two other sampling methods. Can the authors compare their performances with SMOTE?

Thank you for the comments. We considered these methods might not significantly enhance the performance but take a relatively longer computation time and cost. We understand that the comparisons among different and new data augmentation techniques would be important.

(4) External validation. It would be ideal if the authors could further support the utility of their ML-guided strategy by predicting hits from an external dataset of commercially available small molecules like ZINC and testing some for binding interactions with DNA probes.

Thank you for this important comment. Currently, we are working on a large chemical library with commercially available small molecules using the reported ML-method to further validate the usefulness of our ML-method. As the validation with SPR studies is the time-limiting experiments, we hope to include the results in our next publication.

For data checklist,
The dataset and scripts are available in a public repository as follow:
https://github.com/chen26sanken/Machine-learning-approach-toward-generating-the-focused-molecule-library-targeting-CAG-repeat-DNA

The description is updated in the README.md




Round 2

Revised manuscript submitted on 14 Dec 2023
 

29-Dec-2023

Dear Dr Nakatani:

Manuscript ID: DD-COM-08-2023-000160.R1
TITLE: Generation of a focused molecule library by machine learning targeting CAG repeat DNA

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry


 
Reviewer 2

The authors have addressed most of the comments made by both reviewers. The use of regression models would have been more appropriate and in line with the experimental design. The comparison between state-of-the-art classification algorithms was partially answered. The authors chose to address some points in future studies.

Reviewer 1

The authors' revisions have effectively addressed my concerns and improved the clarity and significance of their work. Therefore, I believe this revised version is ready for publication




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license