From the journal Digital Discovery Peer review history

You do not have JavaScript enabled. Please enable JavaScript to access the full features of the site or access our non-JavaScript page.

Round 1

Manuscript submitted on 17 May 2022

Editor’s decision letter

13-Jun-2022

Dear Dr Wilmer:

Manuscript ID: DD-ART-05-2022-000044
TITLE: MOFUN: a Python package for molecular find and replace

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy from CASRAI, https://casrai.org/credit/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************
EDITOR'S COMMENT:

Regarding Referee 2, query 6—At this stage in the journal's life, we are open to papers that describe software as an enabling technology for chemical discovery, without necessarily advancing a specific scientific discovery. Therefore, while I encourage you to address this point in the revised manuscript—for example, explaining research tasks that are enabled or advanced by your software—it is not strictly necessary for you to demonstrate such a use case for acceptance of your paper.

************

Reviewer comments

Reviewer 1

In their submitted work, Boone and Wilmer have created a Python package – MOFUN – for “find-and-replace” queries on both molecular and crystalline structures. While there exist a few tools with similar features (in particular, PoreMatMod.jl for crystalline materials and to a lesser extent, SMILES-based find-and-replace queries for organic molecules), there is still a clear need for the functionalities provided by MOFUN in terms of its approach, its Python interface, and its ability to handle complex molecules and materials with optional periodicity.

As the code-reviewer for this manuscript, my primary focus is on the code itself. I am happy to report that the MOFUN code functions precisely as advertised and is easy to understand. The documentation is both clear and the right level of detail for beginners to become familiar with the package. I am also very happy to see that unit tests are provided with MOFUN and are incorporated as GitHub actions. I tried the provided examples in the documentation and found them illustrative and fully functional. I also tried a few "challenging" queries of my own without major issues to report.
Overall, based on the code itself, I recommend that this work be published in Digital Discovery after a few minor changes.

1. The manuscript states how MOFUN is unique in that it uses distances between atoms, in contrast with a graph-based approach like that in PoreMatMod.jl. As stated in the text, one of the advantages of this is the ability to find-and-replace physisorbed molecules. A related example mentioned in the abstract is the use of MOFUN to add or remove solvent molecules. It would be beneficial to add a minimal example involving a "find-and-replace" procedure for physisorbed adsorbates or solvent molecules to the documentation (and possibly main text if the authors wish).

2. In the Examples section of the documentation, it would also be helpful to highlight at least one example involving modifications of metals or substructures containing metals (and, if done, would likely be worth at least briefly mentioning in the text). The current examples are focused on functionalization of organic molecules and linker-based defects. Highlighting how MOFUN can be used for modifications involving metals would further demonstrate the powerful and general approach of the code. For instance, one could imagine an example where MOFUN is used to exchange a fraction of Zr sites in UiO-66 with Hf or Ce like in DOI: 10.1021/acsmaterialslett.0c00042. Alternatively, one could imagine a MOF like MIL-100 where a Cr–Cl moiety might be “find-and-replaced” with just a V atom, for instance, or vice versa.

3. In the atoms.py __init__ function (and several other places), there are variables instantiated with lists. This is a dangerous practice in Python that can lead to unexpected behavior, especially once others start contributing to the code. All instances where a default argument value is [] or {} should be replaced by `None`, which can then be checked in the code and replaced with the desired default list or dict. See “Default parameter values are evaluated…” in the Python docs here for context: https://docs.python.org/3/reference/compound_stmts.html#function-definitions.

4. In uff4mof.py, the “Pt4+2” key is present twice with two different values such that only the latter is actually used. I think everything is fine here because the latter is from the UFF4MOF paper, but it might be worth simply commenting out the first instance so that a user doesn't get confused when quickly glancing at the code (even though this part of the code was adopted from a different codebase).

5. Does mofun.atoms.Atoms.update_fields() do anything? It looks like the labels are simply iterated through, and `pass` is the only action applied at each iteration of the loop.

6. As suggested in the guideline for authors, it would be worthwhile (although not required) to consider archiving the code in a repository that issues a DOI in case something should ever happen to the currently cited URL. The easiest way to do this would be to link the current GitHub repo with a Zenodo account, as noted here: https://guides.github.com/activities/citable-code. If a DOI is created for the MOFUN package, it should be added to the GitHub repo and the main text.

Reviewer 2

Review comments:
Dear Authors,
I read your manuscript with great interest. I think MOFUN can be a useful tool for the chemistry community. However, I have some concerns regarding the presentations, use cases and applicability of MOFUN which I believe have not been appropriately addressed. Currently the manuscript reads more like a well-written user manual and needs more work to make it more useful for a wide range of scientific community members.
I am requesting a major revision of the manuscript so that the authors can address my concerns.

1. The authors have added a mention of FunctionalizeThis and PoreMatMod.jl.Below I am listing some publications and resources that should be referenced as well:

 Ring-replacement recommender: https://peter-ertl.com/molecular/substituents/ringreplacement.html#
 J. G. Sobez and M. Reiher, Molassembler: Molecular graph construction, modification, and conformer generation for inorganic and organic molecules, J. Chem. Inf. Model., 2020, 60, 3884–3900
 E. I. Ioannidis, T. Z. H. Gani and H. J. Kulik, molSimplify: A toolkit for automating discovery in inorganic chemistry, J. Comput. Chem., 2016, 37(22), 2106–2117
 V. M. Ingman, A. J. Schaefer, L. R. Andreola and S. E. Wheeler, Qchasm: Quantum chemistry automation and structure manipulation, Wiley Interdiscip. Rev.: Comput. Mol. Sci., 2020, 11(4), e1510.
 L. Turcani, A. Tarzia, F. T. Szczypiński and K. E. Jelfs, stk: An extendable python framework for automated molecular and supramolecular structure assembly and discovery, J. Chem. Phys., 2021, 154(21), 214102.
 A. V. Kalikadien, E. A. Pidko and V. Sinha, ChemSpaX: exploration of chemical space by automated functionalization of molecular scaffold, Digital Discovery, 2022, 1, 8-25.

2. The diversity of examples is extremely limited which is in contrast with what the authors say in the first line of this manuscript in the Introduction section
Quoting
“ MOFUN is a general purpose, open-source Python package for searching an arbitrary
molecular structure for a pattern and replacing …. ”
Please demonstrate use cases on other systems as well for example: molecular cages, transition metal complexes, biomolecular systems, periodic systems involving adsorbates on metal surfaces etc. If there are systems where MOFUN cannot be used, they should be mentioned here as well.

3. In example 1 the authors created the linker file by copying it manually using VESTA software. Can linker files be automatically created, or can the user deploy a pre-made library?

4. A related question to point 3: let us say I want to find a pattern of type “C-XYZ”. You can think of CH3 unit, X=Y=Z=H, but it can be different as well, for example for a chiral molecule. Does my search pattern need to have the same bond C-X, C-Y, C-Z bond length and angles e.g. X-C-Y as the one actually present in the molecule? In the example of matching “CH3” in an octane molecule, what happens if I supplied a CH3 search pattern with H-C-H angle of 90 degrees, and C-H bond length of 1.5 Angstrom? Would it still match? If not, then I assume that the user needs to always copy the pattern to be matched from the larger molecule which is being searched itself. These points should be addressed and demonstrated with appropriate examples.

5. I feel that the manuscript lack any chemical insight that is derivable from the application of MOFUN. The authors should add some model use cases for example variation in properties such as a gas uptake capacity upon fund and replace.
6. Related to the point made above, the authors need to make a connect to the “Discovery” aspect of their “Digital” solution. I think the tool MOFUN is useful but to me it seems to be a minor automation of some tasks specific to MOFs, although I think that it can be more than that. It would be highly interesting and relevant if the authors can demonstrate an example where MOFUN can be integrated in a MOF screening workflow, or present some data-driven (e.g. machine learning on a database) examples enabled by MOFUN.

Reviewer 3

Paul Boone and Chris Wilmer report their open-source Python package, MOFUN, for find-and-replace on MOF structures. The use case of the software is to: given an atomistic MOF model, (1) search for and replace particular substructures with different substructures, (2) add defects by deleting linker(s) or metal cluster(s), and (3) atom-typing for using force fields in molecular simulations.
Though PoreMatMod.jl is similar software, MOFUN and PoreMatMod.jl have distinct use cases as summarized in the introduction. And MOFUN is faster, at least using the UiO-66 search-and-replace task for introducing defects, as a benchmark. Plus, it handles more input files and is written in a more widely used language (Python).
I think the report on/description of the software should be published in Digital Discovery. I provide a few suggestions to help the authors improve their paper, based mostly on adding clarity.

# The major-ish points
- Seems the first condition for a match should be element match since that is less compute time to check, compared to the distances...
- Selecting only three points on the search pattern for the alignment makes the procedure, possibly, sensitive to which points you choose. Why not use them all, with a least squares approach? Especially if you choose the three points to be near each other, the result may be sensitive to small perturbations. More computationally efficient though to choose three points. That you picked the two points furthest from each other is great, since this will prevent the sensitivity to small changes in position I allude to. For the example in Fig. 1, "directional axis is the axis from C to H" contradicts this, though, doesn't it? Because the furthest atoms would be H to H that 'span' the fragment.
- A figure/diagram to illustrate your search procedure would be really helpful for understanding the implementation and how it is so fast. It seems to be an instance of depth-first search on a graph. Where each vertex of the graph gives a pair of atoms that are proposed to match and a path down the graph gives a match after you get to the end. You walk down the search tree and reject at any vertex that tells you it is not a viable match, then go back a step, up the graph. (Depth-first search)
- The replacement pattern must lie in the same coordinate system as the search pattern. Is this a limiting disadvantage in some way? I am imagining the replacement pattern being relaxed by DFT, and say the rings are different and the coordinates shifted a bit. Will your code still be able to make a replacement where DFT has relaxed the replacement pattern?
- Tests: is the software well-tested, automatically, instead of just "looking" at the structures? I think adding some automatic unit tests of the functions and having them run on Github actions will make the software more trustworthy and help with future development by catching bugs when they are introduced.

# Adding clarity
- "structure"= a set of atoms" is not precise. the structures seems to include their positions as well.
- Similarly, "search pattern" is not clearly defined.
- "match" is not well-/precisely-defined either. I suspect this is a one-to-one list of atom correspondences between the structure and the search pattern (a mapping), such that the atom types match and so do the interatomic distances, up to a threshold.
- When you calculate distances between each atom in the structure/pattern, you mean "periodic distance", since you use the nearest image convention? Otherwise, not all matches would be found. Later, I see how you replicate the structure, but I was wondering this.
- The r_m = (r_s N) notation needs clear definition. Is this standard? I know combinations, but here the top is a set of vectors and the bottom is an integer (but written as bold for some reason). I've never seen a set or list of vectors in this notation, so it may need explaining. I suspect this means, "all possible one-to-one mappings between atoms of the structure to atoms of the pattern", but I don't think this notation is clear. Or is it an ordered list?
- Based on your equation, there is a well-defined ordering of the coordinates, so this is not a set anymore, but rather a list, and the "match" is a one-to-one map from atoms on the pattern to atoms on the structure.
- The third condition, that a rotation/translation can be applied. To me, this intuitively seemed to be guaranteed by the first condition of interatomic distances matching. At least, for molecules that have no chirality. Do you agree with this? Are there any instances where the interatomic distances match, but a rotation/translation cannot be found? It seems that the way you have it set up, one cannot distinguish between a match and an alignment failure (depends on how robust your alignment algorithm is).
- How do you determine the optimal translation and rotation? An optimization routine of some sort, or rather is there an analytical formula you derived for this? Please precisely clarify. Rotation matrix or quaternions?
- Fig. 1c, shouldn't that have a " not OK" in the top right, too? I am confused because you show the overlay of the pattern in panel D but not panel C? In panel C shouldn't I see an H atom pointing at me, erroneously? I'm trying to align the search pattern in A to the structure to imagine what it would look like if the orange and black axes were aligned. Maybe I misunderstand.
- Randomly picking of one of the good matches: Shouldn't you pick which one has the lowest alignment error? I suppose this is fine too, as you define "good" beforehand so say there is no difference.
- Nice to see the scaling of time with size of system.
- What is the difference between "overridden" vs "deleted and replaced"?
- Seems the force field atom typing could be useful for assigning charges too, if you wished say for aromatic and aliphatic carbons to have different charges. Can the code handle partial charges too?

# Suggestions for code
- I suggest Jupyter Notebooks with examples in the Github repository that can be easily fired up and which can easily read in the files from the relative directory. Not fun to collect files for an example, when you could launch a Jupyter Notebook from a directory with the needed files!
- Possible to use different thresholds on matching criteria for different atom-atom distances?
- When replace_fraction < 1.0, replacements are chosen at random. Is it possible to specify exactly which instances to replace?
- Can the code simply return the list of all possible matches? This might be useful.

# Praise
- Great that you registered with pip for easy installation.
- Nice piece of software!

Author response

************
EDITOR'S COMMENT:

Regarding Referee 2, query 6—At this stage in the journal's life, we are open to papers that describe software as an enabling technology for chemical discovery, without necessarily advancing a specific scientific discovery. Therefore, while I encourage you to address this point in the revised manuscript—for example, explaining research tasks that are enabled or advanced by your software—it is not strictly necessary for you to demonstrate such a use case for acceptance of your paper.

************
REVIEWER REPORT(S):
Referee: 1

Comments to the Author
In their submitted work, Boone and Wilmer have created a Python package – MOFUN – for “find-and-replace” queries on both molecular and crystalline structures. While there exist a few tools with similar features (in particular, PoreMatMod.jl for crystalline materials and to a lesser extent, SMILES-based find-and-replace queries for organic molecules), there is still a clear need for the functionalities provided by MOFUN in terms of its approach, its Python interface, and its ability to handle complex molecules and materials with optional periodicity.

As the code-reviewer for this manuscript, my primary focus is on the code itself. I am happy to report that the MOFUN code functions precisely as advertised and is easy to understand. The documentation is both clear and the right level of detail for beginners to become familiar with the package. I am also very happy to see that unit tests are provided with MOFUN and are incorporated as GitHub actions. I tried the provided examples in the documentation and found them illustrative and fully functional. I also tried a few "challenging" queries of my own without major issues to report.

Overall, based on the code itself, I recommend that this work be published in Digital Discovery after a few minor changes.

Thank you for recommending publication!

1. The manuscript states how MOFUN is unique in that it uses distances between atoms, in contrast with a graph-based approach like that in PoreMatMod.jl. As stated in the text, one of the advantages of this is the ability to find-and-replace physisorbed molecules. A related example mentioned in the abstract is the use of MOFUN to add or remove solvent molecules. It would be beneficial to add a minimal example involving a "find-and-replace" procedure for physisorbed adsorbates or solvent molecules to the documentation (and possibly main text if the authors wish).

See answer to (2).
2. In the Examples section of the documentation, it would also be helpful to highlight at least one example involving modifications of metals or substructures containing metals (and, if done, would likely be worth at least briefly mentioning in the text). The current examples are focused on functionalization of organic molecules and linker-based defects. Highlighting how MOFUN can be used for modifications involving metals would further demonstrate the powerful and general approach of the code. For instance, one could imagine an example where MOFUN is used to exchange a fraction of Zr sites in UiO-66 with Hf or Ce like in DOI: 10.1021/acsmaterialslett.0c00042. Alternatively, one could imagine a MOF like MIL-100 where a Cr–Cl moiety might be “find-and-replaced” with just a V atom, for instance, or vice versa.

We specifically chose example applications from ongoing research in our lab to guarantee that the examples fully address any complications that arise when applying MOFUN to an actual research problem. There are innumerable other examples that one could create (of which you and another reviewer mention a number). Our hope is that other people who use the code will generously donate their use cases and we will have a whole library of examples.

We did add an extra example to the online documentation involving metal centers based on the reference you provided, and noted in the main manuscript that other examples are available online.

3. In the atoms.py __init__ function (and several other places), there are variables instantiated with lists. This is a dangerous practice in Python that can lead to unexpected behavior, especially once others start contributing to the code. All instances where a default argument value is [] or {} should be replaced by `None`, which can then be checked in the code and replaced with the desired default list or dict. See “Default parameter values are evaluated…” in the Python docs here for context: https://docs.python.org/3/reference/compound_stmts.html#function-definitions.

Thank you for bringing this up–somehow in my 10 years of Python experience, I’ve never run into this! This is a such a non-intuitive side-effect. I’ve changed this everywhere in the code as you suggested.

4. In uff4mof.py, the “Pt4+2” key is present twice with two different values such that only the latter is actually used. I think everything is fine here because the latter is from the UFF4MOF paper, but it might be worth simply commenting out the first instance so that a user doesn't get confused when quickly glancing at the code (even though this part of the code was adopted from a different codebase).

Fixed. Thank you.

5. Does mofun.atoms.Atoms.update_fields() do anything? It looks like the labels are simply iterated through, and `pass` is the only action applied at each iteration of the loop.

We have deleted this method, which was unused. Thank you for noticing!

6. As suggested in the guideline for authors, it would be worthwhile (although not required) to consider archiving the code in a repository that issues a DOI in case something should ever happen to the currently cited URL. The easiest way to do this would be to link the current GitHub repo with a Zenodo account, as noted here: https://guides.github.com/activities/citable-code. If a DOI is created for the MOFUN package, it should be added to the GitHub repo and the main text.

We have added a Zenodo DOI and updated the manuscript and the GitHub readme to refer to it. Thank you for the suggestion!

Referee: 2

Comments to the Author
Review comments:
Dear Authors,
I read your manuscript with great interest. I think MOFUN can be a useful tool for the chemistry community. However, I have some concerns regarding the presentations, use cases and applicability of MOFUN which I believe have not been appropriately addressed. Currently the manuscript reads more like a well-written user manual and needs more work to make it more useful for a wide range of scientific community members.
I am requesting a major revision of the manuscript so that the authors can address my concerns.

Thank you for taking the time to referee our manuscript.

1. The authors have added a mention of FunctionalizeThis and PoreMatMod.jl.Below I am listing some publications and resources that should be referenced as well:

 Ring-replacement recommender: https://peter-ertl.com/molecular/substituents/ringreplacement.html#
 J. G. Sobez and M. Reiher, Molassembler: Molecular graph construction, modification, and conformer generation for inorganic and organic molecules, J. Chem. Inf. Model., 2020, 60, 3884–3900
 E. I. Ioannidis, T. Z. H. Gani and H. J. Kulik, molSimplify: A toolkit for automating discovery in inorganic chemistry, J. Comput. Chem., 2016, 37(22), 2106–2117
 V. M. Ingman, A. J. Schaefer, L. R. Andreola and S. E. Wheeler, Qchasm: Quantum chemistry automation and structure manipulation, Wiley Interdiscip. Rev.: Comput. Mol. Sci., 2020, 11(4), e1510.
 L. Turcani, A. Tarzia, F. T. Szczypiński and K. E. Jelfs, stk: An extendable python framework for automated molecular and supramolecular structure assembly and discovery, J. Chem. Phys., 2021, 154(21), 214102.
 A. V. Kalikadien, E. A. Pidko and V. Sinha, ChemSpaX: exploration of chemical space by automated functionalization of molecular scaffold, Digital Discovery, 2022, 1, 8-25.

Thank you for these interesting references! We have extended the introduction to include the applicable references you have included and to also briefly discuss SMILES / SMARTS searching.

2. The diversity of examples is extremely limited which is in contrast with what the authors say in the first line of this manuscript in the Introduction section
Quoting
“ MOFUN is a general purpose, open-source Python package for searching an arbitrary
molecular structure for a pattern and replacing …. ”
Please demonstrate use cases on other systems as well for example: molecular cages, transition metal complexes, biomolecular systems, periodic systems involving adsorbates on metal surfaces etc. If there are systems where MOFUN cannot be used, they should be mentioned here as well.

Although adding more examples would be helpful in some ways, it would also make the paper longer. We would rather have additional examples in online documentation that can be regularly updated while reserving just a few in the manuscript to be illustrative of the main features of the method. There are many other examples that one could create (of which you and another reviewer mention a number). Our hope is that other people who use the code will generously donate their use cases and we will have a whole library of examples.

3. In example 1 the authors created the linker file by copying it manually using VESTA software. Can linker files be automatically created, or can the user deploy a pre-made library?

Any means of creating a search or replacement pattern may be used as long as atom positions and elements are defined and can be read into MOFUN using a supported file format (or defined explicitly in Python).

4. A related question to point 3: let us say I want to find a pattern of type “C-XYZ”. You can think of CH3 unit, X=Y=Z=H, but it can be different as well, for example for a chiral molecule. Does my search pattern need to have the same bond C-X, C-Y, C-Z bond length and angles e.g. X-C-Y as the one actually present in the molecule? In the example of matching “CH3” in an octane molecule, what happens if I supplied a CH3 search pattern with H-C-H angle of 90 degrees, and C-H bond length of 1.5 Angstrom? Would it still match? If not, then I assume that the user needs to always copy the pattern to be matched from the larger molecule which is being searched itself. These points should be addressed and demonstrated with appropriate examples.

Since MOFUN identifies matches by relative positions of atoms, the placement of the atoms in the search pattern and the structure need to be within a tolerance δ. This tolerance can be adjusted to accommodate more variation in the structure from the search pattern. This is useful for handling small random perturbations in the structure but is not necessarily sufficient for handling more extreme variation.

For searching for patterns that have multiple configurations, such as alkane chains, it might be more suitable to use a graph-based search / replace program such as the PoreMatMod.jl, as discussed in the introduction.

We’ve added an additional sentence to the conclusion to mention this limitation:

Since MOFUN identifies patterns using relative positions, a search pattern may not match all expected instances of the pattern in the structure if the positions vary in the structure, for example for longer alkane chains.

5. I feel that the manuscript lack any chemical insight that is derivable from the application of MOFUN. The authors should add some model use cases for example variation in properties such as a gas uptake capacity upon fund and replace.
6. Related to the point made above, the authors need to make a connect to the “Discovery” aspect of their “Digital” solution. I think the tool MOFUN is useful but to me it seems to be a minor automation of some tasks specific to MOFs, although I think that it can be more than that. It would be highly interesting and relevant if the authors can demonstrate an example where MOFUN can be integrated in a MOF screening workflow, or present some data-driven (e.g. machine learning on a database) examples enabled by MOFUN.

We agree that this paper does not attempt to connect to “Discovery” as we are solely presenting software enabling “Discovery” for us and others without attempting to draw out chemical insight or bigger scientific examples. The editor has mentioned that this is an appropriate submission to the journal.

Referee: 3

Comments to the Author
Paul Boone and Chris Wilmer report their open-source Python package, MOFUN, for find-and-replace on MOF structures. The use case of the software is to: given an atomistic MOF model, (1) search for and replace particular substructures with different substructures, (2) add defects by deleting linker(s) or metal cluster(s), and (3) atom-typing for using force fields in molecular simulations.
Though PoreMatMod.jl is similar software, MOFUN and PoreMatMod.jl have distinct use cases as summarized in the introduction. And MOFUN is faster, at least using the UiO-66 search-and-replace task for introducing defects, as a benchmark. Plus, it handles more input files and is written in a more widely used language (Python).
I think the report on/description of the software should be published in Digital Discovery. I provide a few suggestions to help the authors improve their paper, based mostly on adding clarity.

Thank you for recommending publication!

# The major-ish points
- Seems the first condition for a match should be element match since that is less compute time to check, compared to the distances...

This is an astute observation and an earlier version of the code was based on this idea. We created separate lists of each element and then used that as a starting point for matching successive atoms in the pattern. This was faster when we were calculating distances in Python, but when we converted to getting nearby atoms and calculating distances in NumPy and Scipy (which does the actual work in the C layer), iterating through a list of all atoms in the structure with matching elements was much, much slower than getting the nearby atoms first. Still, we do still use an element match prior to the final distance matches for exactly the reason you mention.

Roughly, the current optimized code is this:
We find all nearby atoms to a starting atom. This operation is performed by Numpy on a previously sorted list in the C layer.
We calculate the distances between all atoms using Scipy’s distance routines, which is also highly optimized in the C layer. These distances are calculated once and used for all possible distance matches for the starting atom.
Then, when evaluating each potential atom as a match for the next atom in the search pattern, we first (A) check the element matches, then (B) check all possible distances from the new atom to the prior atoms.

We’ve added a new figure (Figure 2 in the revised manuscript) that gives an overview of this process, and we’ve added a couple sentences to the optimization section:

The calculation for identifying nearby structure atoms is implemented as a filter on a presorted NumPy array of the coordinates and is therefore executed as highly-optimized compiled C code. The distances between all nearby atoms are precalculated together using the SciPy library, which is again executed as highly optimized C code.

- Selecting only three points on the search pattern for the alignment makes the procedure, possibly, sensitive to which points you choose. Why not use them all, with a least squares approach? Especially if you choose the three points to be near each other, the result may be sensitive to small perturbations. More computationally efficient though to choose three points. That you picked the two points furthest from each other is great, since this will prevent the sensitivity to small changes in position I allude to. For the example in Fig. 1, "directional axis is the axis from C to H" contradicts this, though, doesn't it? Because the furthest atoms would be H to H that 'span' the fragment.

This is a great idea that is worth looking into. Also, it could be possible to use a similar approach towards matching patterns in addition to alignment. This would definitely be more computationally expensive, but I can see it giving better results for certain systems.

You are correct that we were defining an arbitrary directional axis in Figure 1 which was possibly not the best way of arranging that figure. We have completely redone Figure 1 and the new Figure 1 uses the default longest directional axis, which is one of the H-H pairs, as you correctly bring up.

- A figure/diagram to illustrate your search procedure would be really helpful for understanding the implementation and how it is so fast. It seems to be an instance of depth-first search on a graph. Where each vertex of the graph gives a pair of atoms that are proposed to match and a path down the graph gives a match after you get to the end. You walk down the search tree and reject at any vertex that tells you it is not a viable match, then go back a step, up the graph. (Depth-first search)

At a very high-level you can think of our algorithm as depth-first per starting atom, but breadth-first for building out matches on that starting atom. We have created a new figure (Figure 2 in the revised manuscript) that hopefully helps give readers an overview of some of the optimizations we’ve employed.

- The replacement pattern must lie in the same coordinate system as the search pattern. Is this a limiting disadvantage in some way? I am imagining the replacement pattern being relaxed by DFT, and say the rings are different and the coordinates shifted a bit. Will your code still be able to make a replacement where DFT has relaxed the replacement pattern?

Yes, the code will be able to do the replacement, though you will have to be careful with where the replacement pattern connects to an existing structure. You can imagine the relaxation moving the ends of a pattern by say 1.5Å and when you replace the pattern the ends of the pattern overlap with the atoms they are supposed to be bonded to. If you were to constrain the positions of the atoms that connect to the structure and relax everything else, there would be no problem.

- Tests: is the software well-tested, automatically, instead of just "looking" at the structures? I think adding some automatic unit tests of the functions and having them run on Github actions will make the software more trustworthy and help with future development by catching bugs when they are introduced.

Yes, the software is tested and the tests are automatically run using Github actions. These were highly helpful and necessary when we were optimizing for speed and debugging all the symmetry edge conditions. Every edge case we ran into while developing the software is tested to prevent regressions.

# Adding clarity
- "structure"= a set of atoms" is not precise. the structures seems to include their positions as well.
- Similarly, "search pattern" is not clearly defined.
- "match" is not well-/precisely-defined either. I suspect this is a one-to-one list of atom correspondences between the structure and the search pattern (a mapping), such that the atom types match and so do the interatomic distances, up to a threshold.

We have updated the manuscript to clarify these terms:

We use the word structure here to mean a set of atoms with elements and positions along with (optional) periodic boundaries that we want to exhaustively examine for instances of a search pattern, which is also a set of atoms with elements and positions. When we find an instance of the search pattern in the structure (a set of atoms in the structure whose relative positions are the same as the relative positions in the search pattern), we call this a match. Every match found can be replaced with a replacement pattern, which is another set of atoms with elements, positions, and optional charges and force-field terms.

- When you calculate distances between each atom in the structure/pattern, you mean "periodic distance", since you use the nearest image convention? Otherwise, not all matches would be found. Later, I see how you replicate the structure, but I was wondering this.

You can think of it as a periodic distance, though since we are searching an expanded structure that includes positions in the neighboring images “unwound” to the coordinate system of the primary unit cell, it can also be considered just a distance on the expanded non-periodic structure.

- The r_m = (r_s N) notation needs clear definition. Is this standard? I know combinations, but here the top is a set of vectors and the bottom is an integer (but written as bold for some reason). I've never seen a set or list of vectors in this notation, so it may need explaining. I suspect this means, "all possible one-to-one mappings between atoms of the structure to atoms of the pattern", but I don't think this notation is clear. Or is it an ordered list?

We have replaced that notation–which you correctly note was imprecise–with explanatory text:

Let r_mbe an N-length list containing N positions from r_s that we will examine as a trial match in the structure.

- Based on your equation, there is a well-defined ordering of the coordinates, so this is not a set anymore, but rather a list, and the "match" is a one-to-one map from atoms on the pattern to atoms on the structure.

Thank you for noticing this. In all cases, we have lists not sets, and we have updated the word everywhere it was used.

- The third condition, that a rotation/translation can be applied. To me, this intuitively seemed to be guaranteed by the first condition of interatomic distances matching. At least, for molecules that have no chirality. Do you agree with this? Are there any instances where the interatomic distances match, but a rotation/translation cannot be found? It seems that the way you have it set up, one cannot distinguish between a match and an alignment failure (depends on how robust your alignment algorithm is).

We originally thought this was guaranteed by the first condition as well, and it was only when we were seeing some very strange results that we realized the first condition was not sufficient. The two cases where this is not true are A) chirality, and (B) symmetry.

Figure 1 and the supported text was supposed to explain why this is the case, but was clearly not very effective. We have completely redone Figure 1 and updated the supporting text in the manuscript so hopefully this is more clear now.

- How do you determine the optimal translation and rotation? An optimization routine of some sort, or rather is there an analytical formula you derived for this? Please precisely clarify. Rotation matrix or quaternions?

We use the scipy.spatial.transform.Rotation class and quaternions. The full code can be seen here:

https://github.com/WilmerLab/mofun/blob/main/mofun/helpers.py

See:
quaternion_from_two_vectors()
quaternion_from_two_vectors_around_axix()

We don’t attempt to determine an optimal translation and rotation: presently, we just align the directional axis and rotate around the directional axis to align the orientation point in the right direction. For cases where the search pattern positions precisely match the trial match positions without error (i.e. the match is found with a tolerance δ close to zero), there will be little to no error in the end result. When a larger tolerance is necessary to match a pattern, it is possible for the replacement pattern to be aligned sub-optimally. This is mitigated by two factors:
having a low tolerance δ.
using a flexible force field and relaxing the system after the replace.
For our current tests and ongoing use cases, we have not run into alignment problems from lack of optimization being an issue for a replacement operation. With that said, if we were to start seeing this problem, we would consider implementing a more robust alignment algorithm based on least squares, like you have suggested above.

We have added the following clarification to the paragraph on alignment:

This alignment procedure is fast as it is precisely defined and doesn’t require an optimization, but for structures where the match positions do not precisely match the search pattern (i.e. finding a pattern requires a tolerance δ), the resulting replacement atoms may be aligned sub-optimally proportional to the required tolerance.

- Fig. 1c, shouldn't that have a " not OK" in the top right, too? I am confused because you show the overlay of the pattern in panel D but not panel C? In panel C shouldn't I see an H atom pointing at me, erroneously? I'm trying to align the search pattern in A to the structure to imagine what it would look like if the orange and black axes were aligned. Maybe I misunderstand.

We apologize for Figure 1! Thank you for all the feedback here and explaining how figure 1 was confusing (in multiple ways). We completely redid the figure with all your feedback and comments in mind so hopefully this is more clear now.

- Randomly picking of one of the good matches: Shouldn't you pick which one has the lowest alignment error? I suppose this is fine too, as you define "good" beforehand so say there is no difference.

With our current code, a match is binary: it is either a match or it isn’t. We’ve mainly applied this for adding defects to structures, which we wanted to be fully random. In this case, using some kind of match quality would make this procedure less random, which would sabotage our objective.

We’d be interested to hear about other applications where this might be helpful.

- Nice to see the scaling of time with size of system.
- What is the difference between "overridden" vs "deleted and replaced"?

Overriding leaves in place existing potentials that may not be defined in the replacement pattern, but any potentials that are defined in the replacement pattern override those in the structure.

We’ve updated the sentence in the paper to be more precise:

if any of the atoms are shared between both the search pattern and the replace pattern (i.e. if the atoms share the same element and position), then any force field potentials defined on these atoms are overridden, rather than all existing terms being deleted and replaced by what’s in the replacement pattern.

- Seems the force field atom typing could be useful for assigning charges too, if you wished say for aromatic and aliphatic carbons to have different charges. Can the code handle partial charges too?

Yes, replacement patterns can have partial charges. We’ve added a couple clarifying sentences, under usage details:

Structures and patterns must include coordinates and elements and can optionally contain periodic boundaries, charges and other metadata… When using a CIF file format, all extra columns attached to an atoms collection as well as bond, angle, and torsion geometry collections are supported when doing a replace.

# Suggestions for code
- I suggest Jupyter Notebooks with examples in the Github repository that can be easily fired up and which can easily read in the files from the relative directory. Not fun to collect files for an example, when you could launch a Jupyter Notebook from a directory with the needed files!

This is a good idea and something we would like to add in the future.

- Possible to use different thresholds on matching criteria for different atom-atom distances?

This is an interesting idea that we had not considered, but I can see this helping to more reasonably match flexible patterns. Currently there is one blanket tolerance that applies to all atom pairs.

- When replace_fraction < 1.0, replacements are chosen at random. Is it possible to specify exactly which instances to replace?

Not at the moment. Currently, the find operation returns all possible matches and can include the quaternions, so it is almost possible to do this manually, but this wouldn’t handle everything. I think adding a feature to do a targeted replacement could be really helpful for some applications. I’d be really interested to hear about what use cases this would enable!

- Can the code simply return the list of all possible matches? This might be useful.

Yes, the code can return all found matches (basically the find without the replace). This can return the quaternions used to rotate the search pattern to align with the match pattern.

We’ve added a sentence to this effect under MOFUN: Usage Details:

One can also run a find operation without replacing anything; in this case, in addition to returning what atoms were matched, the quaternions necessary to rotate the search pattern into place are also returned.

# Praise
- Great that you registered with pip for easy installation.
- Nice piece of software!

Again, thank you for the detailed feedback, and suggestions!

************
FILES TO PROVIDE WITH YOUR REVISED MANUSCRIPT :
IMPORTANT: Your original files are available to you when you upload your revised manuscript. Please delete any redundant files before completing the submission. Please carefully check the spelling and format of all author names, affiliations and funding information. If your paper is accepted for publication, it is important this information is accurate to ensure your article is correctly indexed, which may affect citations and future funding evaluation. Please note that if you have selected Accepted Manuscript publication, the author list will appear as provided in the ScholarOne submission details until your Advance Article is published and this information is updated from your article.

• A point-by-point response to the comments made by the reviewer(s). (If selecting transparent peer review, please copy your full response to reviewers into the “Your Response” text box provided.)
• Your revised manuscript with any changes clearly marked (.doc(x) or.pdf file)
• Your revised manuscript as a .doc(x) file including figures, without highlighting, track changes, etc. (If providing in TeX format instead, please also provide a final PDF version including figures). Please note that we cannot proceed with publication using a .pdf file only.

Round 2

Revised manuscript submitted on 08 Aug 2022

Editor’s decision letter

09-Aug-2022

Dear Dr Wilmer:

Manuscript ID: DD-ART-05-2022-000044.R1
TITLE: MOFUN: a Python package for molecular find and replace

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Joshua Schrier
Associate Editor, Digital Discovery

Reviewer comments

Reviewer 1

I thank the authors for addressing my comments. I have no further comments or suggestions and recommend the manuscript for acceptance.

Reviewer 3

The authors have taken my comments seriously and addressed them. Nice software and paper!

Reviewer 2

Dear authors, I am really glad that you took the effort to write and publish MOFUN, a tool that was initially developed for your internal workflows.
from the main text: "We built MOFUN, initially, to support our own investigations..."
Such efforts are very helpful in empowering the scientific community at a large scale but often remain underappreciated. I look forward to integrating MOFUN into my own materials research workflows. I am very hopeful that the publication of MOFUN in the Digital Discovery journal will help its wider adoption by the community, and further development.
Best regards
Best regards

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.