From the journal Digital Discovery Peer review history

Database for liquid phase diffusion coefficients at infinite dilution at 298 K and matrix completion methods for their prediction

Round 1

Manuscript submitted on 05 Jul 2022
 

06-Sep-2022

Dear Dr Jirasek:

Manuscript ID: DD-ART-07-2022-000073
TITLE: Data base for diffusion coefficients at infinite dilution at 298 K and matrix completion methods for their prediction

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 1

The paper <i>Data base for diffusion coefficients at infinite dilution at 298 K and matrix completion methods for their prediction</i> develops three important contributions.

First, they construct a comprehensive database of diffusion coefficients, mainly taken from the Dortmund Data Bank. In the submission letter, the authors reflect on the relative scarcity of diffusion data. I agree with this view and know too well the 'fun' of trying to put together a set of diffusion coefficients to test models with. This is a useful resource - albeit with one caveat.

Has there been any attempt to review any likely systemic issues in the comprehensive database? A common issue with diffusion methods is convection, where experimental diffusion coefficients are (much) higher than expected. Figure 8 implies a number of samples where convection might be happening. On the other hand, aggregation would lead to lower diffusion coefficients than expected.

In the Analytical Chemistry paper already cited (DOI:10.1021/acs.analchem.7b05032), a list of criteria were developed to check the likely suitability of diffusion coefficients (Scope, Systematic Miscalibration, Inconsistent Diffusion Coefficients, Evidence of Convection, Evidence of Aggregation). While a filter has been applied to the whole set of DDB diffusion coefficients, has a similar approach to the Analytical Chemistry paper been taken for the diffusion coefficients used? Or do the authors think it is not required?

To go further, the testing database contains compounds and molecules that I would expect SEGWE to really struggle with. Figure 7 shows a grid of residuals of the SEGWE model and indicates CO<sub>2</sub> as having the largest deviations. I would not expect CO<sub>2</sub> to behave at all similarly to the organic compounds used to generate SEGWE in its original form (2013, DOI:10.1002/anie.201207403) - I would imagine it to be denser and therefore move faster. Likewise, water's tendency to hydrogen bond would help explain its large deviations, in the other direction to those of carbon dioxide. Delving into the database, I find solutes #124-127 as inorganic solids (chlorides, thiocyanates) and #144-146 another set of inorganic solids. Again, I would not necessarily expect SEGWE to handle these compounds at all well. Some discussion of the database spanning a wider range of chemical space than the models were intended for would be useful.

Second, they use the comprehensive data base to assess a number of different models. I think this is very useful and a good addition to the science.

While the Wilke and Chang approach is derived from the Stokes-Einstein equation, the use of a 1/2 power for molecular weight dependence (rather than 1/3) and different power relations for molecular weight and molecular volume leave it looking like a cousin of power-law based models such as those proposed by Crutchfield and Harris (2007, DOI:10.1016/j.jmr.2006.12.004), Williard (2009 review, DOI:10.1021/ar800127e) and Stalke (2015, DOI:10.1039/C5SC00670H). I think some mention of these alternative methods is needed. A recent Progress in NMR Spectroscopy review (10.1016/j.pnmrs.2019.11.002) compares and contrasts these various methods, from Stokes-Einstein, past Wilke and Change, to power laws and SEGWE.

Third, and this is the area in which I am less able to assess, they develop a data-driven method for predicting diffusion coefficients based on both SEGWE but also matrix-completion methods. As far as I can see, this is a solid advance and also leads to better prediction of diffusion coefficients. Is there any way this can be made more accessible to a wide range of users? A simple GUI where you select solvent, select solute and it generates a predicted diffusion coefficient would make the whole work much more readily available for use. Both SEGWE and Stalke, as well as a recent Analytical Chemistry paper on the diffusion of proteins (DOI:10.1021/acs.analchem.8b05617), have such tools. These make use of the new models very easy indeed.

A final, more general, point. The structure at the start of the paper is a bit haphazardly structured. It jumps in the first page from an introductory paragraph, to something akin to a method, before going back to a more detailed introduction. I ended up jumping back and forth at the very start of the paper before really getting going with my review.

Please do not take the length of the review as a negative. I recommend a minor revision: First, based on discussion of experimental diffusion coefficients and systemic errors within which will help others use the comprehensive database; second, based on the discussion of alternative models for predicting diffusion coefficients, and finally, third, based on thinking about tools to make the new predictions highly accessible to a wide range of users.

Reviewer 2

1. The manuscript mainly discusses the liquid phase diffusion coefficient, and it is recommended that the title be changed to Data base for liquid phase diffusion coefficients at infinite dilution at 298 K and matrix completion methods for their prediction.
2. Are there other methods in the literature on diffusion coefficient prediction? For example, is the QSPR method feasible?

Reviewer 3

This seems like good work with potentially high impact as a reference database of computed values, but the quality of data presentation wrt machine-actionability is poor. Some examples:

* there is dependence on a "DIPR database", for which there is a text citation but no link, no indication of specific values/version used, etc. Thus, reproduction is jeopardized.

* Worksheet labels are incorrect wrt their designations in the ESI PDF vs the xlsx files. "ListOfCompenents" is actually e.g. "FullListOfComponents", "DataBase" is actually e.g. "FullDataBase", etc. This may seem minor, but it hinders automated alignment of descriptions with data, i.e. machine actionability, and is a sign of sloppiness that calls into question the rigor applied to other aspects of data preparation and reporting that may not be as straightforwardly evaluated.

* There are two tables in e.g. the FullListOfComponents sheet, which could be one table, e.g. with columns such as "solute name", but it is not. This hampers machine actionability. If the file format chosen were e.g. CSV, which is generally preferred to xlsx for machine actionability, then this need would be more apparent.

* wrt the Stan code, there are 4 isolated files, but no (machine-actionable) indication of how they are applied, e.g. there is no indication of input/output of xlsx files. The computational workflow for reproduction/verification of method thus appears absent.

I feel the work may be strong, but the data presentation/registration requires major revision in my estimation for this work to be appropriate for inclusion in this journal.


 

We thank the referees for their thorough assessments and the useful recommendations.

>Referee #1

>Comments to the Author
>The paper Data base for diffusion coefficients at infinite dilution at 298 K and matrix completion methods for their prediction develops three important contributions.

>First, they construct a comprehensive database of diffusion coefficients, mainly taken from the Dortmund Data Bank. In the submission letter, the authors reflect on the relative scarcity of diffusion data. I agree with this view and know too well the 'fun' of trying to put together a set of diffusion coefficients to test models with. This is a useful resource - albeit with one caveat.

>Has there been any attempt to review any likely systemic issues in the comprehensive database? A common issue with diffusion methods is convection, where experimental diffusion coefficients are (much) higher than expected. Figure 8 implies a number of samples where convection might be happening. On the other hand, aggregation would lead to lower diffusion coefficients than expected.

>In the Analytical Chemistry paper already cited (DOI:10.1021/acs.analchem.7b05032), a list of criteria were developed to check the likely suitability of diffusion coefficients (Scope, Systematic Miscalibration, Inconsistent Diffusion Coefficients, Evidence of Convection, Evidence of Aggregation). While a filter has been applied to the whole set of DDB diffusion coefficients, has a similar approach to the Analytical Chemistry paper been taken for the diffusion coefficients used? Or do the authors think it is not required?

The critical assessment of the data is indeed a very important point and we have in fact partly used similar criteria as in the paper the referee mentions. To give a better understanding of the data filtering and curation approach that we have used, we have added a new section to the ESI (page 1):

“S.1 Data Curation

In the following, we describe the criteria that we have applied for deciding whether to adopt a data point to our data base or not. First, all data points that were labeled in the Dortmund Data Bank (DDB) to be of poor quality were omitted. Furthermore, we have excluded all solutes and solvents without a well-defined molecular composition, such as polymers and pseudocomponents (e.g., seawater, jet fuel, bitumen). In cases where we found data points to be erroneously labeled in the DDB, e.g., when predicted data was reported as experimental data, or in cases where the reported type of diffusion coefficient was unclear, we have excluded that data as well.

Moreover, the consistency of the reported diffusion coefficients was assessed in two ways. First, for mixtures for which multiple data points at similar concentrations (differences below 0.02 mol/mol) were reported by different authors, those deviating by more than one standard deviation from the mean were excluded. Second, for mixtures for which data points were measured over a range of concentrations, we have removed those data points that deviated more than one standard deviation from the fitted curve describing the concentration dependence of Dij (cf. description of the fitting procedure in Section 2 of the manuscript).

Going beyond the formal data curation steps described above, we note that the matrix completion methods (MCMs) developed in this work can be used to obtain information on erroneous data: MCMs basically analyze data sets for (hidden) structure, which they will not be able to find in the case of erroneous data; hence, such data points are likely to be outliers in the MCM predictions. Therefore, it is interesting to analyze the outliers in the MCM predictions closer in order to find out whether the deviation might stem from errors in the data. However, this requires applying methods beyond the MCMs and was not in the scope of the present work.”

We refer to the new section of the ESI in the manuscript on page 3:

“We have restricted this study to well-defined molecular components, i.e., we have excluded mixtures that contain polymers or pseudocomponents as well as some other special cases (cf. ESI Section S.1). We will not continue to mention these restrictions in the following discussion.”

and

“We have preferred applying a standard procedure over an ad hoc consideration of each mixture, not only because of the time required for this but also to avoid ambiguity. In all cases, obvious outliers were rejected beforehand (cf. ESI Section S.1).”

>To go further, the testing database contains compounds and molecules that I would expect SEGWE to really struggle with. Figure 7 shows a grid of residuals of the SEGWE model and indicates CO2 as having the largest deviations. I would not expect CO2 to behave at all similarly to the organic compounds used to generate SEGWE in its original form (2013, DOI:10.1002/anie.201207403) - I would imagine it to be denser and therefore move faster. Likewise, water's tendency to hydrogen bond would help explain its large deviations, in the other direction to those of carbon dioxide. Delving into the database, I find solutes #124-127 as inorganirc solids (chlorides, thiocyanates) and #144-146 another set of inorganic solids. Again, I would not necessarily expect SEGWE to handle these compounds at all well. Some discussion of the database spanning a wider range of chemical space than the models were intended for would be useful.

We absolutely agree with the referee that an interpretation of which systems can be particularly poorly (or well) described by the established models, as well as by the newly developed MCMs, is highly interesting. Although we consider a comprehensive discussion of this issue to be outside the scope of the present work, which is focused on establishing a data base on Dij∞ and developing new models for their prediction, we have added a new section with a discussion focusing on some particularly difficult examples to the ESI on pages 4-7:

“S.2.6 Mixtures Poorly Described by Semiempirical Models

In this section, we take a closer look at those mixtures from our data base, for which Dij∞ is only poorly described by the semiempirical models, and we try to specify those groups of solutes and solvents for which this is the case. We thereby focus on SEGWE, but also briefly touch upon the other models.

For discussing the performance of SEGWE in detail, we refer to Figure 7 in the manuscript, which shows the residuals of the SEGWE predictions from the experimental data. One solute that SEGWE is apparently struggling to describe accurately is water (solute i=27, cf. Figure 7). In our reduced data base, there are eight mixtures with the solute water; the relative deviations of the SEGWE predictions from the experimental data for Dij∞ for these eight mixtures are shown in Figure S.2.

[Figure S.2 here]
Figure S.2: Relative deviations δDij∞ =(Dij∞,pred-Dij∞,exp)/Dij∞,exp of the SEGWE predictions for Dij∞ of the solute water in different solvents from the experimental data from the reduced data base.

We find the largest positive relative deviations for mixtures in which strong hydrogen bonding occurs, namely the mixtures (water + ethanol) and (water + 1-propanol). Slightly smaller, but still large positive relative deviations are found for mixtures of water with solvents in which weaker hydrogen bonds are formed (acetone, butyl acetate, N-methyl-2-pyrrolidone, and methyl isopropyl ketone, cf. Figure S.2). This is not astonishing as the developers of SEGWE have explicitly excluded data for mixtures with “aggregating components” in the development of SEGWE[11]. Aggregation leads to lower diffusion coefficients; an effect which is not described by SEGWE, which, as a consequence, overpredicts Dij∞ in such mixtures, cf. Figure S.2.

High positive relative deviations of the SEGWE predictions from the experimental data are also found for many other hydrogen bonding systems in our data base.

Furthermore, SEGWE mispredicts Dij∞ in mixtures where the molecular mass in relation to the molecule size strongly differs between both components. This is in particular the case if one of the components contains heavy atoms, and the other does not. The reason for this is that in the development of SEGWE, it was assumed that both solute and solvent can be modeled as hard spheres, and that both spheres have an equal ratio of mass to volume – the so-called effective density ϱeff of the mixture.

An instructive example for this case is the result for the solute carbon dioxide (i=39) in Figure 7 of the manuscript. Carbon dioxide has a relatively large molecular mass in relation to its molecular volume, which leads to a rather high effective density compared to, e.g., typical organic solvents. Accordingly, we find SEGWE to significantly underestimate Dij∞ for basically all mixtures with carbon dioxide from the reduced data base (cf. Figure 7 in the manuscript), and even for all mixtures with carbon dioxide from the full data base (not shown here).

Two other examples for solutes in our data base with rather high effective densities are methyl iodide (i=19), which is due to the heavy iodine atom, and the fully fluorinated hexafluorobenzene (i=30); we find that SEGWE also underestimates the diffusion in all mixtures containing these two solutes. Returning to Figure S.2 as a last example, we can likewise explain the significant underestimation of the experimental Dij∞ in the mixture (water + hexadecane) by the higher effective density of water in relation to that of hexadecane (and the absence of significant attractive forces in the mixture to counteract this effect).

Finally, we briefly touch on the limitations of the models of Wilke and Chang[4], Reddy and Doraiswamy[5], and Tyn and Calus[7]. Due to their similar nature they are all subject to similar restrictions, so that they will be discussed together here. Despite the original authors’ intention to provide general-purpose correlations that work in nonpolar and polar mixtures alike, all three models have been found to struggle significantly with hydrogen bonding mixtures (as it is also the case for SEGWE). Hence, they overpredict Dij∞ for hydrogen bonding solvents, such as methanol, ethanol and 1-propanol. Further, the Wilke-Chang model is inaccurate in the prediction of the diffusion of water in organic solvents, which has been described before in the literature[12]. Accordingly, we find a significant overestimation of Dij∞ by the Wilke-Chang model for nearly all mixtures from the reduced data base in which water is the solute, with the exception of the mixture (water + hexadecane). This trend is not observed for the models of Tyn and Calus or Reddy and Doraiswamy.

Lastly, we note that MCMs can be used to identify such systematic deviations in the predictions of (semiempirical) models, and that MCMs can also predict them, which is used in the hybrid MCM based on “boosting” for improving the performance of the semiempirical models, cf. Figure 6 in the manuscript.”

We refer to the new section of the ESI in the manuscript on page 9:

“A more detailed discussion of the mixtures for which SEGWE gives predictions with particularly large errors is included in the ESI (cf. Section S.2.6).”

Further, we have included a new paragraph in the ESI on page 2, mentioning the general limitations of the scope of the semiempirical models:

“While the four semiempirical models have been developed as general-purpose correlations that aim at describing a diverse set of mixtures and components, there are still some restrictions in the scope of these models, which we briefly mention here. All authors have limited their models to moderate viscosities and have excluded data for viscous solvents (e.g., polymers) from their training sets. Further, none of the semiempirical models were trained on data of mixtures containing electrolytes, i.e., neither mixtures with salts as solutes nor with ionic liquids as solutes or solvents should be expected to be predicted with high accuracy.”

>Second, they use the comprehensive data base to assess a number of different models. I think this is very useful and a good addition to the science.

We again thank the referee for their kind evaluation of our work.

>While the Wilke and Chang approach is derived from the Stokes-Einstein equation, the use of a 1/2 power for molecular weight dependence (rather than 1/3) and different power relations for molecular weight and molecular volume leave it looking like a cousin of power-law based models such as those proposed by Crutchfield and Harris (2007, DOI:10.1016/j.jmr.2006.12.004), Williard (2009 review, DOI:10.1021/ar800127e) and Stalke (2015, DOI:10.1039/C5SC00670H). I think some mention of these alternative methods is needed. A recent Progress in NMR Spectroscopy review (10.1016/j.pnmrs.2019.11.002) compares and contrasts these various methods, from Stokes-Einstein, past Wilke and Change, to power laws and SEGWE.

This is a good point and we mention this now in the introduction and refer the reader to further literature. Specifically, we have added on page 2 of the manuscript:

“A large number of further semiempirical models for the prediction of Dij∞ in binary liquid mixtures or extensions upon the previously mentioned ones exist in the literature, but most of them are either less general (in the scope of the components that can be modeled by them) or less accurate than these [12]. Power-law models, which have also been applied in the literature for modeling diffusion coefficients [13-15], suffer from a similar restriction in generality as they must be “calibrated” to a specific substance group, and they depend strongly on the type of components investigated. For a more detailed discussion of such approaches and their delimitation from the semiempirical models investigated here, we refer to the review of Evans [16].”

>Third, and this is the area in which I am less able to assess, they develop a data-driven method for predicting diffusion coefficients based on both SEGWE but also matrix-completion methods. As far as I can see, this is a solid advance and also leads to better prediction of diffusion coefficients. Is there any way this can be made more accessible to a wide range of users? A simple GUI where you select solvent, select solute and it generates a predicted diffusion coefficient would make the whole work much more readily available for use. Both SEGWE and Stalke, as well as a recent Analytical Chemistry paper on the diffusion of proteins (DOI:10.1021/acs.analchem.8b05617), have such tools. These make use of the new models very easy indeed.

We thank the referee for this suggestion and have now included a user-friendly GUI (CalculationTool.xlsx), where any solute-solvent combination can be selected by name (or DDB No.) via drop-down boxes and the respective predicted values of both hybrid MCMs as well as their model uncertainties in the form of standard deviations are given as output.

On this occasion, we also note that we have already included all predictions of the developed hybrid MCMs as a machine-readable spreadsheet in the ESI, where the values for all 10,608 possible solute-solvent combinations are tabulated using the DDB identification numbers.

>A final, more general, point. The structure at the start of the paper is a bit haphazardly structured. It jumps in the first page from an introductory paragraph, to something akin to a method, before going back to a more detailed introduction. I ended up jumping back and forth at the very start of the paper before really getting going with my review.

We thank the referee for this remark and agree: the structure of the introduction was in fact not optimal. We have, therefore, restructured the introduction (please see the changes highlighted in the marked version of the manuscript) and hope that the introduction is now easier to read.

>Please do not take the length of the review as a negative.

Not at all! Thank you for the thorough review and many valuable suggestions!

>I recommend a minor revision: First, based on discussion of experimental diffusion coefficients and systemic errors within which will help others use the comprehensive database; second, based on the discussion of alternative models for predicting diffusion coefficients, and finally, third, based on thinking about tools to make the new predictions highly accessible to a wide range of users.


>Referee #2

>Comments to the Author

>1. The manuscript mainly discusses the liquid phase diffusion coefficient, and it is recommended that the title be changed to Data base for liquid phase diffusion coefficients at infinite dilution at 298 K and matrix completion methods for their prediction.

This is a good suggestion, which we have gladly adopted. The title of the manuscript was changed accordingly.

>2. Are there other methods in the literature on diffusion coefficient prediction? For example, is the QSPR method feasible?

It is on purpose that we have restricted our comparison of the newly developed methods to the selected four semiempirical models, since they are the most established general-purpose models in this field. There are, of course, many more models in the literature for predicting diffusion coefficients, including QSPR approaches, which are, however, all restricted to a small number of components or classes of components. While it was not our intention to provide a comprehensive review of all available prediction methods for diffusion coefficients, we have now included a short discussion on alternative models on page 2 of our revised manuscript (cf. also our answer to one of the questions of the first referee):

“A large number of further semiempirical models for the prediction of Dij∞ in binary liquid mixtures or extensions upon the previously mentioned ones exist in the literature, but most of them are either less general (in the scope of the components that can be modeled by them) or less accurate than these [12]. Power-law models, which have also been applied in the literature for modeling diffusion coefficients [13-15], suffer from a similar restriction in generality as they must be “calibrated” to a specific substance group, and they depend strongly on the type of components investigated. For a more detailed discussion of such approaches and their delimitation from the semiempirical models investigated here, we refer to the review of Evans [16].”

Regarding QSPR approaches, we have added some text discussing such methods on page 2 of the manuscript:

“Descriptor-based methods of the QSPR type can also be used for predicting mixture properties, and of course also for the prediction of diffusion coefficients. In particular, artificial neural networks (ANNs) have been used successfully in QSPR approaches by several authors [22-25], however, these studies were often restricted to specific mixtures, such as diffusion in water [24,25] or diffusion in hydrocarbon mixtures [22]; general-purpose models for the prediction of diffusion coefficients at infinite dilution based on ML methods are still missing to date.

An interesting class of unsupervised ML algorithms for the prediction of thermophysical properties of mixtures in general, and of Dij∞ in particular, are matrix completion methods (MCMs), which are already established in recommender systems, e.g., for providing suitable movie recommendations to customers of streaming providers [26,27].”


>Referee #3

>Comments to the Author
>This seems like good work with potentially high impact as a reference database of computed values, but the quality of data presentation wrt machine-actionability is poor.

>Some examples:

>* there is dependence on a "DIPR database", for which there is a text citation but no link, no indication of specific values/version used, etc. Thus, reproduction is jeopardized.

We have amended the citation in question by specifying the version of the DIPPR data base that we have used and by providing a URL to the institute’s homepage:

“R. L. Rowley, W. V. Wilding, J. L. Oscarson, Y. Yang, N. A. Zundel, T. E. Daubert and R. P. Danner, DIPPR Data Compilation of Pure Chemical Properties, Design Institute for Physical Properties, AIChE, 2003, https://www.aiche.org/dippr, Data base date: 2018, retrieved via The DIPPR Information and Data Evaluation Manager for the Design Institute for Physical Properties - Version 12.3.0 (May 2018 Public).”

>* Worksheet labels are incorrect wrt their designations in the ESI PDF vs the xlsx files. "ListOfCompenents" is actually e.g. "FullListOfComponents", "DataBase" is actually e.g. "FullDataBase", etc. This may seem minor, but it hinders automated alignment of descriptions with data, i.e. machine actionability, and is a sign of sloppiness that calls into question the rigor applied to other aspects of data preparation and reporting that may not be as straightforwardly evaluated.

The label descriptions used in the ESI were intended to serve as a lumped reference for both provided data bases. However, we understand the confusion that may arise from this naming and we have now addressed this issue by providing separate files for each data base and sorting them into two folders as described in detail in our answer to the referee’s next comment.

>* There are two tables in e.g. the FullListOfComponents sheet, which could be one table, e.g. with columns such as "solute name", but it is not. This hampers machine actionability. If the file format chosen were e.g. CSV, which is generally preferred to xlsx for machine actionability, then this need would be more apparent.

We agree with the referee that .csv files should be preferred over .xlsx files and regret the deficiency of machine actionability of our initial submission. To address these issues, we have restructured the tabular ESI and simultaneously switched from .xlsx files to the open .csv format. The data has now been organized into two folders, “full” and “reduced” (representing the full data base and the reduced data base that we provide). In each folder, the following files are found:

- List_Solutes.csv
- List_Solvents.csv
- DataBase.csv
- SEGWE.csv
- Boosting_Predictions.csv
- Boosting_LV_Solutes.csv
- Boosting_LV_Solvents.csv
- Whisky_Predictions.csv
- Whisky_LV_Solutes.csv
- Whisky_LV_Solvents.csv

They are identical in name between the “full” and “reduced” folder, and differ only in the data basis (full data base matrix: 208x51 components, reduced data base matrix: 45x23 components) used.

While the total number of files is now larger, we believe the new structure and naming scheme are self-explanatory and directly lead to a better machine-accessibility of the ESI.

Accordingly, we have adjusted the description of these files in the ESI on pages 8-10.

>* wrt the Stan code, there are 4 isolated files, but no (machine-actionable) indication of how they are applied, e.g. there is no indication of input/output of xlsx files. The computational workflow for reproduction/verification of method thus appears absent.

In addition to the previously attached Stan files, we have now included a wrapper code for each MCM, i.e., a MATLAB script that reads the training data from a .csv file, applies the developed MCMs for the prediction of the full matrix, and exports the result to a .csv file. We have made note of this on page 21 of the ESI:

“Furthermore, we have included a wrapper code for each MCM, i.e., a MATLAB script that reads the training data from a .csv file, applies the developed MCMs for the prediction of the full matrix, and exports the result to a .csv file.”

We absolutely agree with the referee that method reproduction and verification are of high importance. However, we have to note here again (as we also openly describe in the manuscript) that we are not able to provide the full training sets for the newly developed models. This is due to parts of the training set being proprietary data taken from the Dortmund Data Bank. We also see the current data situation in our field highly critical and believe that open accessibility of (any kind of) data should be the default, not the exception.

While we cannot simply ignore the current policy of the Dortmund Data Bank, we emphasize that we directly contribute to the open availability of data with the present work by including all predictions for all possible combinations of all considered solutes and solvents in the ESI.

>I feel the work may be strong, but the data presentation/registration requires major revision in my estimation for this work to be appropriate for inclusion in this journal.

We again thank the referee for their careful assessment of our work and the valuable suggestions.




Round 2

Revised manuscript submitted on 28 Sep 2022
 

06-Oct-2022

Dear Dr Jirasek:

Manuscript ID: DD-ART-07-2022-000073.R1
TITLE: Data base for liquid phase diffusion coefficients at infinite dilution at 298 K and matrix completion methods for their prediction

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry


 
Reviewer 2

The paper summarized data base for liquid phase diffusion coefficients at infinite
dilution at 298 K and matrix completion methods for their prediction, and similar to a reference manual.

Reviewer 1

A thorough response, covering every point in an excellent level of detail. Thank you.




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license