From the journal Digital Discovery Peer review history

You do not have JavaScript enabled. Please enable JavaScript to access the full features of the site or access our non-JavaScript page.

Round 1

Manuscript submitted on 03 Nov 2023

Editor’s decision letter

05-Feb-2024

Dear Dr White:

Manuscript ID: DD-ART-11-2023-000217
TITLE: Predicting small molecules solubility on endpoint devices using deep ensemble neural networks

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

(I apologize for the long delay in reviewing: We had one reviewer ask for two months of extensions...only to never submit the review.)

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************

Reviewer comments

Reviewer 1

The authors have a nice presentation of their new solubility prediction model. Comparisons with other baseline or benchmark models provided sufficient information. However, a more elaborated discussion of how their models compare to descriptors-based QSAR models and NLP-based Transformer models is needed, not just by comparing the prediction accuracy, but also how these models solve the solvation or solubility prediction problems and how they explain the predictions. In addition to this major issue, a few minor issues are encouraged to be fixed.

Major issues:
1. Authors should be more careful in discussing the advantages of their models over Transformer models. Since Transformer is now available in open-source frameworks like Hugging Face, it is easy to implement a new Transformer or use a pre-trained Transformer for fine-tuning or direct prediction. Therefore, the authors should detail the reasons why they chose an LSTM or RNN-based architecture over the Transformer. Also, as far as I know, LSTMs and RNNs are slow to train, whereas Transformer is easier because it can process multiple words at once rather than sequentially.
2. The authors could further discuss the advantages of their model over models using molecular descriptors. First of all, some types of molecular descriptors are very slow to compute, probably because the software for the algorithm is too old and cannot be accelerated by GPU. I have tried PaDEL before and it took 12 hours to complete a descriptor calculation for about 5,000 molecules. This makes PaDEL impossible to use on an AI-based web platform. Another problem with descriptors is their interpretability. While it is easy to trace a descriptor back from a prediction in a QSAR model, it is difficult to further explain some complex descriptors. Some descriptors are difficult to understand by people outside the field of computational chemistry. On the other hand, deep learning models such as GNN and Transformers can directly explain their predictions by highlighting their feature maps or attention layers. In the field of predictive chemistry, many Transformer-based models have succeeded in explaining their prediction through the heatmap of attention layers.

Minor issues:
1. I recommend that authors update their references to the peer-reviewed version of the arXiv papers whenever possible. For example, reference 91 has been published in "Digital Discovery". Also, reference 63 has been published in "Machine Learning: Science and Technology" but with a different title.
2. As Digital Discovery is mostly a chemistry journal, it is uncommon to have a section of “related works” which is more frequently seen in computer science conference papers. I recommend the authors merge this section with the introduction. The introduction of the works not directly related to this paper can be significantly more concise.
3. More instructions should be provided on the website. After I enter the molecule, I do not know how to start prediction. Pressing “enter” does not work. It would be better to have a “start prediction” clickable button.
4. The concluding section is too short and should have summarized the strengths of the model in more detail and presented future plans in terms of scientific research. With a more detailed discussion of the above two major issues, I believe the authors should be able to expand on their conclusions.

Reviewer 2

The authors present results from a deep learning modeling efforts to predict solubility and runs it through a static website. Despite many publications that report using machine learning effort to predict solubility, predicting solubilities stills is an interesting area to compare different of modeling approach since the data for modeling is available. However, authors should have referenced many of the previous efforts and at least attempted to distinguish their efforts from those of other researchers.

1. Page 1, paragraph2: reference 24 was published at 2015 and after that year, many accurate models regarding prediction of solubility were published.
2. There is a similar work like as this study: https://doi.org/10.1186/s13321-021-00575-3
3. There is a recent published paper which answered the solubility challenge from Linas study. https://doi.org/10.1186/s13321-023-00752-6
From that paper, some molecules from Table 5 were tested by the author’s website:”mol.dev”
Here are the results:

Hexobarbital : Intrinsic Solubility: -2.67 ( mol.dev : -5.76 ± 2.51)
Nalidixic_acid: −3.61 (mol.dev : -6.30 ± 3.30)
Phenanthroline: −1.61 (mol.dev : -6.91 ± 3.26)
Phenobarbital: −2.29 (mol.dev : -6.80 ± 2.72)

Also from Table S5 of the nominated paper:
Acebutolol: -2.67 (mol.dev : -7.95 ± 3.88)
Amoxicillin: -2.03 (mol.dev : -8.35 ± 3.07)
Bendroflumethiazide: -3.89 (mol.dev : -9.39 ± 2.93)
Benzocaine: -2.41 (mol.dev : -6.28 ± 2.40)
There are very big difference between results from the authors model and the real numbers. Also the amount of uncertainty is almost half of the actual numbers which is not acceptable.

4. The authors have not made it clear what the advantage is of this new study and should do so
Here are a few numbers of published studies which use deep learning models using descriptors:
https://doi.org/10.1186/s13321-021-00575-3
https://doi.org/10.1021/acsomega.1c01035
https://doi.org/10.1093/bib/bbab291
https://doi.org/10.1016/j.isci.2020.101961

5. Page 2, paragraph 4: QSPR and MLR are not a similar method which can be compared. QSPR or Quantitative Structure–Property Relationships is an approach that can be used in modeling methods like as descriptors-based and group contribution. However, MLR is an algorithm to solve the problem.
6. Section 3.1. Authors stated that they used 9982 molecules from AqSolDB. However, they mentioned that they augmented data to 96,625. First, the authors need to be more specific regarding the methods which was used for generating randomized SMILES strings. Second, database need to be shared even as an additional file or it should update in Github. Lastly, more characterization of the utilized data is required.

7. Page 6, first paragraph: kde4 LSTM aug model achieved 1.049, 1.054, and 1.340. There are not such values in table 1 which was referred to.
8. Page 6. Section 4.3. The authors cited Lankshminaryanan study. In that study, they used ensemble size 1, 5 and 10. What was the reason to choose ensemble size 4 and 10 in this study? As mentioned in Lankshminaryanan study, the ensemble weights need to be optimize.
9. The authors need to state the accuracy of the model by itself. Results for table 1 and 2 are counted as the validation part of the model.
10. The largest outliers should be analyzed.
11. Authors may use a consistent number of significant figures (2 decimal places) when reporting r2 or MAE/RMSE values etc.
12. Table 1: RF model was better for solubility challenge 2_1 and 2-2. Please comment on possible reasons.
13. Section 3.1, authors said they employed SMILES strings for their model. It should be clear about the used model to convert strings to the input values. Other than, as they mentioned in “conclusion” section, they predict LogS directly from SMILES. However, it is not clear why they used descriptor-based method in “results” section. If they used this method to compare RF with DNN/kde, why RF was not modeled on the SMILE strings instead of descriptors?

The authors mentioned each entry of AqSolDB was used to generate ten new unique SMILE string. Such ten SMILES for each entry cannot be seen in the uploaded database. Database with 96,625 molecules need to be shared even as an additional file or it should update in Github.

Reviewer 3

The manuscript presents a novel approach to predicting small molecule solubility using deep ensemble neural networks, focusing on the use of in-browser machine learning models. This study is significant due to its potential impact on the accessibility and usability of machine learning models in chemistry. The authors have developed a deep ensemble processing model that quantifies uncertainty, enhancing the reliability of predictions. The approach of running the model on a static website is innovative, potentially reducing implementation costs and improving accessibility. I recommend it for publication after addressing following minor comments.

- The field of molecular solubility prediction is highly active, with numerous relevant studies existing. Enhancing the paper with a more comprehensive comparison to these existing models and a deeper exploration of the limitations of the proposed model would be advantageous. It would be helpful for readers if a table summarizing these comparisons and analyses were included in the Supplementary Information.

- Similarly, the superiority of the proposed method requires a more in-depth comparison. This could be achieved by more thoroughly analyzing and contrasting the method with those outlined in previous studies.

Author response

Dear Dr. Schrier,

Thanks for processing our submission and sending the review comments.
We apologize for the 3-week delay.
We carefully considered each reviewer's report and addressed the issuer below.
A pdf with the same content but better formatting with all the reviewers' reports and our answers was uploaded with the updated main text.

Best regards,
Dr. Mayk Caldas
Postdoctoral associate, University of Rochester

################ Reviewers' response ################

# Referee 1 #

Comments to the Author
The authors have a nice presentation of their new solubility prediction model. Comparisons with other baseline or benchmark models provided sufficient information. However, a more elaborated discussion of how their models compare to descriptors-based QSAR models and NLP-based Transformer models is needed, not just by comparing the prediction accuracy, but also how these models solve the solvation or solubility prediction problems and how they explain the predictions. In addition to this major issue, a few minor issues are encouraged to be fixed.

Answer: We thank the referee for the comments. We carefully read the suggestions and changed the manuscript accordingly. All reviewer's comments were addressed below.

## Major issues ##

1- Authors should be more careful in discussing the advantages of their models over Transformer models. Since Transformer is now available in open-source frameworks like Hugging Face, it is easy to implement a new Transformer or use a pre-trained Transformer for fine-tuning or direct prediction. Therefore, the authors should detail the reasons why they chose an LSTM or RNN-based architecture over the Transformer. Also, as far as I know, LSTMs and RNNs are slow to train, whereas Transformer is easier because it can process multiple words at once rather than sequentially.

Answer: Transformers present a great performance in most applied tasks. The referee is correct when saying transformers are faster to train because they use positional embedding to enable the process of multiple words at once. However, Our main concern in this project is during inference time. Transformers are large models. Therefore, developing an ensemble of transformers and hosting all their parameters in the final device is impossible due to its limited storage and computing power. In addition, our model is a deep-ensemble model, which makes the dependence on the model's size even more critical. We added a paragraph in the introduction to justify our model's architecture choice.

2- The authors could further discuss the advantages of their model over models using molecular descriptors. First of all, some types of molecular descriptors are very slow to compute, probably because the software for the algorithm is too old and cannot be accelerated by GPU. I have tried PaDEL before and it took 12 hours to complete a descriptor calculation for about 5,000 molecules. This makes PaDEL impossible to use on an AI-based web platform. Another problem with descriptors is their interpretability. While it is easy to trace a descriptor back from a prediction in a QSAR model, it is difficult to further explain some complex descriptors. Some descriptors are difficult to understand by people outside the field of computational chemistry. On the other hand, deep learning models such as GNN and Transformers can directly explain their predictions by highlighting their feature maps or attention layers. In the field of predictive chemistry, many Transformer-based models have succeeded in explaining their prediction through the heatmap of attention layers.

Answer: One of our baselines is a random forest that uses PaDEL descriptors. It took around 20 hours to compute descriptors for all molecules in AqSolDB in our computers. We included this argument to further motivate our approach.
Explaining deep learning model predictions is a complex task. In this study, we were not interested in explaining the predictions or performing any explainable artificial intelligence (XAI), which is considered out-of-scope for this project. However, we plan to add this feature to mol.dev in the future since we have other research lines focused on XAI in our lab.

## Minor issues ##

1- I recommend that authors update their references to the peer-reviewed version of the arXiv papers whenever possible. For example, reference 91 has been published in "Digital Discovery". Also, reference 63 has been published in "Machine Learning: Science and Technology" but with a different title.

Answer: Thank you catching this error. We have carefully gone over the references section and adjusted any mistakes.

2- As Digital Discovery is mostly a chemistry journal, it is uncommon to have a section of “related works” which is more frequently seen in computer science conference papers. I recommend the authors merge this section with the introduction. The introduction of the works not directly related to this paper can be significantly more concise.

Answer: We merged both `Introduction' and `Related Works' sections. Hence, they needed to be reviewed to fit the introduction's purpose and be concise.

3- More instructions should be provided on the website. After I enter the molecule, I do not know how to start prediction. Pressing “enter” does not work. It would be better to have a “start prediction” clickable button.

Answer: The calculation is triggered upon changes in any of the two text boxes below the "Enter Molecule" field. We are unsure how the referee entered the input in the boxes, but either typing or pasting a sequence automatically triggers the prediction. No "enter" or button is needed. We suggest clicking outside the textbox after adding the molecule SMILES. Lastly, the referee should try to refresh the page to make sure the website is loaded correctly.

4- The concluding section is too short and should have summarized the strengths of the model in more detail and presented future plans in terms of scientific research. With a more detailed discussion of the above two major issues, I believe the authors should be able to expand on their conclusions.

Answer: We agree with the referee that the conclusion could be more informative. The conclusions were completely rewritten to discuss our research strengths and future perspectives.

# Referee 2 #

## Comments to the Author ##
The authors present results from a deep learning modeling efforts to predict solubility and runs it through a static website. Despite many publications that report using machine learning effort to predict solubility, predicting solubilities stills is an interesting area to compare different of modeling approach since the data for modeling is available. However, authors should have referenced many of the previous efforts and at least attempted to distinguish their efforts from those of other researchers.

Answer: We thank the referee for the very detailed review of our work. We understand solubility prediction is a very extensive area with many approaches and models already developed and continuously being developed over the years. However, SOTA models perform similarly to a simple random forest trained on molecular descriptors. Therefore, we included a discussion of feature-based models in our `Related Works` section and focused the rest of the paper on models that could extract properties directly from SMILES. The meaning of ``directly from SMILES'' is also pointed out by the referee and will be discussed below, together with other specific issues raised by the referee.

## Review ##

1- Page 1, paragraph2: reference 24 was published at 2015 and after that year, many accurate models regarding prediction of solubility were published.

Answer: Despite reference 24 (updated to Ref 26 in the new version of the manuscript) being published in 2015, the statement is still valid. Developing accurate, robust models for solubility prediction is still a significant challenge. We added a recent paper that discusses similar problems to support the argument.

2- There is a similar work like as this study: https://doi.org/10.1186/s13321-021-00575-3

Answer: This issue will be answered together with item 4 since they are overlapping.

3- There is a recent published paper which answered the solubility challenge from Linas study. https://doi.org/10.1186/s13321-023-00752-6.
From that paper, some molecules from Table 5 were tested by the author’s website:``mol.dev''
Here are the results:

Molecule & Intrinsic Solubility & mol.dev old predictions & mol.dev predictions\\\hline
Hexobarbital & -2.67 & -5.76 $\pm$ 2.51 & -2.94 $\pm$ 1.73\\
Nalidixic\_acid & -3.61 & -6.30 $\pm$ 3.30 & -2.45 $\pm$ 1.67\\
Phenanthroline & -1.61 & -6.91 $\pm$ 3.26 & -4.29 $\pm$ 2.53\\
Phenobarbital & -2.29 & -6.80 $\pm$ 2.72 & -2.55 $\pm$ 2.02\\
% Also from Table S5 of the nominated paper:
Acebutolol & -2.67 & -7.95 $\pm$ 3.88 & -3.74 $\pm$ 1.66\\
Amoxicillin & -2.03 & -8.35 $\pm$ 3.07 & -2.50 $\pm$ 1.76\\
Bendroflumethiazide & -3.89 & -9.39 $\pm$ 2.93 & -4.86 $\pm$ 1.87\\
Benzocaine & -2.41 & -6.28 $\pm$ 2.40 & -2.41 $\pm$ 1.74\\

There are very big difference between results from the authors model and the real numbers. Also the amount of uncertainty is almost half of the actual numbers which is not acceptable

Answer: The cited paper from Tayyebi et al. employs an RF approach based on descriptors/fingerprints. When using molecular descriptors, they report an MAE of 1.2, while using Morgan fingerprints achieved an MAE of 0.64 on Llinas' first challenge (Table S5). Our baseline RF achieved an MAE of 0.914 when using PaDEL descriptors, while our deep ensemble model achieved an MAE of 0.843. Tayyebi et al. results with Morgan fingerprints were added to our discussion.
Focusing on the table provided by the referee, we would like to thank you for catching this error.
The values displayed in the paper were calculated using the model implemented in the website's backend for easier programmatic access.
However, when the model was deployed, the website was not reading the vocabulary correctly.
We fixed a bug on the website where the vocabulary file wasn't being correctly read. The model was tokenizing the selfie sequence wrongly, which led to a wrong prediction. We fixed that issue, and the predicted values are now in accordance with the back-end model.
The table provided above shows new values calculated after this hotfix.}

4- The authors have not made it clear what the advantage is of this new study and should do so. Here are a few numbers of published studies which use deep learning models using descriptors:
4.1 https://doi.org/10.1186/s13321-021-00575-3
4.2 https://doi.org/10.1021/acsomega.1c01035
4.3 https://doi.org/10.1093/bib/bbab291
4.4 https://doi.org/10.1016/j.isci.2020.101961

Answer: Here are summaries of each paper pointed out by the referee:
4.1 The work of Ye and Ouyang compared a gradient-boost machine (lightGBM) with several machine learning algorithms for predicting the solubility of the solute in different solvents and temperatures. Their lightGBM model used ECPF fingerprints and, as seen in Table 4, achieved an MSE of 0.7511 on unseen solutes.
4.2 Kurotani et al. developed a model that receives analytical data (NMR information, refractive index, and density) to predict solubility parameters, which are different properties. Their models performed similarly (RMSE of $\sim$~0.6). Kurotani et al. also developed a web application for their model. In their website, one can see the domain-specific data that is needed to perform a prediction.
4.3 Zagidullin et al. investigated the impacts of the molecular representation in the prediction. They considered 11 variants of molecular fingerprints and trained decision trees as a regression model. They show that carefully selecting which fingerprint to use is a crucial step of the project. Their best results achieved an RMSE of $\sim$~0.73.
4.4 Sorkun et al. evaluated data selection's impacts on training their consensus model. They selected subsets of their previously published AqSolDB following different standard deviations among experimental solubility measurements. They showed that the quality of the data is crucial for the model. In addition, they developed a consensus model (similar to an ensemble but composed of different model algorithms) that can outperform the SOTA models.

Analysis: Comparatively, our model does not require either descriptors or fingerprint calculation. We are able to achieve comparable performance to SOTA models only by using strings representation (a much simpler and less computationally intensive representation). On our website, the user can copy and paste this representation instead of requiring domain-specific information.
The work of Zagidullin et al. shows how extracting information from string representations is useful and can avoid a very complex representation design.
In addition, our model is much easier to use than the work of Kurotani et al., highlighting the main strength of our study.
Indeed, we could not outperform Sorkun's consensus model. This never was our goal. We hypothesize that: (1) we can implement an accurate solubility model that uses a more straightforward data representation. (2) Because our representation is simple and the model does not require large computing power, we can host it on a static website to improve usability.
We validated our model using Llinas' challenge datasets (see item 9). These models were excluded from our comparative analysis with literature because they did not evaluate the performance of their models on Llinas' datasets}

5- Page 2, paragraph 4: QSPR and MLR are not a similar method which can be compared. QSPR or Quantitative Structure–Property Relationships is an approach that can be used in modeling methods like as descriptors-based and group contribution. However, MLR is an algorithm to solve the problem.

Answer: We are not comparing QSPR and MLR. The referred paragraph shows different approaches commonly used to predict solubility. QSPR and MLR approaches are the most common to accomplish this task, as shown by Llinas \texttt{et al.} in their solubility challenge.
We changed the sentence ``Historically, common approaches computed aqueous solubilities based on QSPR and MLR methods.'' to ``Historically, common approaches computed aqueous solubilities used methods based on QSPR or MLR.''

6- Section 3.1. Authors stated that they used 9982 molecules from AqSolDB. However, they mentioned that they augmented data to 96,625. First, the authors need to be more specific regarding the methods which was used for generating randomized SMILES strings. Second, database need to be shared even as an additional file or it should update in Github. Lastly, more characterization of the utilized data is required.

Answer: SMILES randomization is a technique to generate non-canonical smiles. In this technique, the atom numbering is done differently, resulting in different SMILES. We cited the relevant literature describing the method.
The AqSolDB dataset and the code used to augment it are available. To address the referee's comments, we also included the augmented dataset in data.zip.

7- Page 6, first paragraph: kde4 LSTM aug model achieved 1.049, 1.054, and 1.340. There are not such values in table 1 which was referred to.

Answer: Thank you for catching this error. The text was updated to reflect correct values as displayed in Table 1.

8- Page 6. Section 4.3. The authors cited Lankshminaryanan study. In that study, they used ensemble size 1, 5 and 10. What was the reason to choose ensemble size 4 and 10 in this study? As mentioned in Lankshminaryanan study, the ensemble weights need to be optimize.

Answer: The main goal of this study is to show that we can make deep learning models accessible and easy to use by non-specialists. To achieve this goal, we needed to balance model's accuracy and size. Increasing the ensemble size improves accuracy at the cost of a more computationally expensive inference. We ran tests using 2, 4, 8, and 12 models in the ensemble and decided further to investigate the region between 8 and 12 models. Therefore, we found that using 10 models in this application results in a good balance between computational cost and accuracy.
As far as I remember, Lakshminarayanan's study uses a uniformly weighted ensemble of models. During the discussions, they suggested that optimizing the model's weights might result in further improvement. However, no systematic study was presented. In our study, we used a uniformly weighted ensemble.

9- The authors need to state the accuracy of the model by itself. Results for table 1 and 2 are counted as the validation part of the model.

Answer: One argument comprehensively shown by Sorkun2021 and Francoeur2021 is that the dataset used to train the model greatly impacts the model's accuracy. Francoeur et al. showed that using the ESOL dataset for validation, the same model achieved an RMSE of 0.278 when trained using the ESOL training set but an RMSE of 2.99 when trained using AqSolDB. For that reason, we argue that comparing models using their validation metrics is not a fair comparison. Therefore, we decided to use Llinas' datasets as our validation set and compare our performance against other studies that did the same.

10- The largest outliers should be analyzed.

Answer: We focused this study on developing and deploying the model to the website. We assessed the model's accuracy to prove it could achieve SOTA-comparable performance. We leave further analysis of the model's caveats, explanations, interpretations, and improvements for future works.

11- Authors may use a consistent number of significant figures (2 decimal places) when reporting r2 or MAE/RMSE values etc.

Answer: Every metric from our experiments reported significant figures using three digits. However, studies from the literature reported only two digits in some cases.
Because we do not want to reproduce metrics from other studies with a different number of significant figures, we did not change either our report or the literature report to have two significant figures.

12- Table 1: RF model was better for solubility challenge 2\_1 and 2-2. Please comment on possible reasons.

Answer: Random forests are still the state of the art for small molecules. Because it uses carefully selected descriptors, the model has more relevant physical input and is intrinsically invariant. Our goal is not to outperform it. Instead, we aim to show that RNNs can perform comparably to SOTA models with a much simpler input and can be used for fast inference in a device with low computing power.

13- Section 3.1, authors said they employed SMILES strings for their model. It should be clear about the used model to convert strings to the input values. Other than, as they mentioned in “conclusion” section, they predict LogS directly from SMILES. However, it is not clear why they used descriptor-based method in “results” section. If they used this method to compare RF with DNN/kde, why RF was not modeled on the SMILE strings instead of descriptors?

Answer: In Section 2.2 and Figure 1, we show the workflow used to convert SMILES or SELFIES to the model input: ``Our DNN model uses Self-referencing embedded strings (SELFIES) tokens as input. SMILES or SELFIES molecule representations are converted to tokens based on a pre-defined vocabulary generated from our training data, resulting in 273 available tokens.''. We improved this paragraph by describing how the vocabulary was generated.
We mention that we predict LogS directly from SMILES because this translation from string representations to tokens is done internally.

14- The authors mentioned each entry of AqSolDB was used to generate ten new unique SMILE string. Such ten SMILES for each entry cannot be seen in the uploaded database. Database with 96,625 molecules need to be shared even as an additional file or it should update in Github.

Answer: This issue was addressed on item 6.

# Referee 3 #

## Comments to the Author ##
The manuscript presents a novel approach to predicting small molecule solubility using deep ensemble neural networks, focusing on the use of in-browser machine learning models. This study is significant due to its potential impact on the accessibility and usability of machine learning models in chemistry. The authors have developed a deep ensemble processing model that quantifies uncertainty, enhancing the reliability of predictions. The approach of running the model on a static website is innovative, potentially reducing implementation costs and improving accessibility. I recommend it for publication after addressing following minor comments.

Answer: We thank the author for the positive review. Minor points considered were carefully answered below.

## Minor issues ##

1- The field of molecular solubility prediction is highly active, with numerous relevant studies existing. Enhancing the paper with a more comprehensive comparison to these existing models and a deeper exploration of the limitations of the proposed model would be advantageous. It would be helpful for readers if a table summarizing these comparisons and analyses were included in the Supplementary Information.

Answer: One of our arguments, in accordance with Sorkun2021 and Francoeur2021, is that the dataset used to train the model greatly impacts its validation performance. As shown by Sorkun et al. (2021), using different subsets of AqSolDB led to models with very different validation RMSE. To address this concern, we used Llinas et al. datasets as validation/benchmark to make the comparison fair. Therefore, only studies that tested their model using the same dataset were included in Table 2.
However, we agree with the referee that the paper could benefit from a more comprehensive explanation of our model's limitations and how it compares with other models available in the literature. We reformulated our introduction to discuss how our model differs from other feature-based models and transformers.

2- Similarly, the superiority of the proposed method requires a more in-depth comparison. This could be achieved by more thoroughly analyzing and contrasting the method with those
outlined in previous studies.

Answer: Similarly to the previous item, the introduction was reformulated to include a deeper discussion on the gaps left by other models and how our model fills that gap. We also added a deeper discussion comparing our model against transformers and the newest models to the `Discussion' section.

Round 2

Revised manuscript submitted on 23 Feb 2024

Editor’s decision letter

07-Mar-2024

Dear Dr White:

Manuscript ID: DD-ART-11-2023-000217.R1
TITLE: Predicting small molecules solubility on endpoint devices using deep ensemble neural networks

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our LinkedIn account [https://rsc.li/Digital_showcase] please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Joshua Schrier
Associate Editor, Digital Discovery

Reviewer comments

Reviewer 1

The authors have effectively addressed the concerns raised in the initial manuscript. Additionally, their discussion on the trade-off between the model's performance and its complexity is commendable. The effort to create a user-friendly machine learning platform, despite the inherent challenges compared to developing models in a pure code format, is particularly noteworthy. Such a platform could significantly benefit researchers lacking a background in coding. Given these considerations, I recommend the acceptance of this paper.

Reviewer 3

The authors addressed the comments properly; hence I recommend it for publication.

Reviewer 2

I have reviewed the revised manuscript and the authors’ responses to my comments. The manuscript, in my opinion, is quite ready for publication and I have no other comments on either my concerns or the other reviewers’ criticism.

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.