Paul
Boone
a and
Christopher E.
Wilmer
*abc
aDepartment of Chemical and Petroleum Engineering, University of Pittsburgh, 3700 O'Hara Street, Pittsburgh, Pennsylvania 15261, USA. E-mail: wilmer@pitt.edu
bDepartment of Electrical and Computer Engineering, University of Pittsburgh, 3700 O'Hara Street, Pittsburgh, Pennsylvania 15261, USA
cClinical and Translational Science Institute, University of Pittsburgh, 3700 O'Hara Street, Pittsburgh, Pennsylvania 15261, USA
First published on 23rd August 2022
MOFUN is an open-source Python package that can find and replace molecular substructures in a larger, potentially periodic, system. In the context of molecular simulations, find and replace is a useful operation for adding/swapping functional groups, adding/removing solvent molecules or defect sites, and many other helpful system perturbations. MOFUN can also be used to alter force field terms on certain atoms while leaving the geometry/composition otherwise unchanged. The package is easily automated, which is particularly helpful for preparing input files for large-scale screenings. The package is freely available on GitHub at https://github.com/WilmerLab/mofun.
Prior to the development of MOFUN, numerous packages have been developed that can perform ligand replacement or substituent replacement on smaller non-periodic molecules,3–5 for preparing structures for quantum calculations. Pattern replacement can also be performed on non-periodic molecules represented as human-readable SMILES strings.6 Parts of a molecule can be searched for using a SMARTS7 pattern or even simpler (but much less precisely) by using a regex-based text find and replace on the SMILES string itself.
Before MOFUN, Wilmer8 developed a molecular search-and-replace program called FunctionalizeThis! designed for crystalline structures such as MOFs. FunctionalizeThis! did not support finding and replacing bonds and other force field terms, which limited its use particularly when attempting to generate structures that can be used with flexible force fields. More recently, a free and open-source Julia package named PoreMatMod.jl was reported by Henle et al.9 Like MOFUN, the highly versatile package by Henle et al. can be used for automating crystal structure modifications and was at least partly motivated to facilitate research on hypothetical MOFs. While there is significant overlap in functionality between MOFUN and PoreMatMod.jl, there are also a few key differences. Whereas MOFUN searches for patterns via comparisons of distances between atoms, PoreMatMod.jl analyzes the molecular graph defined by how a structure's atoms are bonded. Both approaches can handle many common use cases but sometimes one approach is more suitable than the other, in terms of what kinds of patterns can be searched for. For example, when using distances between atoms, it is possible to search for patterns that are not bonded, such as a molecule physisorbed to a binding site or matching defects between two different layers of stacked graphene. In contrast, molecular graph searches are much better suited to searching for substructures in a conformation-invariant manner, such as when looking for hydrocarbon chains that can assume varied configurations in space while their molecular graphs stay the same. While PoreMatMod.jl can be applied to many of the same problems as MOFUN, like FunctionalizeThis! it does not support finding and replacing bonds or higher-order force field terms.
In addition to simpler use cases, MOFUN was built to support find and replace of substructures that are fully parameterized to use with flexible force fields for molecular dynamics (MD) simulations (in particular, for LAMMPS10). By releasing MOFUN as open-source and announcing it here, we hope that other researchers will also benefit from this general-purpose package and can use it to accelerate and expand their research.
MOFUN is available on GitHub at https://github.com/WilmerLab/mofun under the open-source MIT license. The version described in this paper is version 1.0.1 (https://doi.org/10.5281/zenodo.6950355). MOFUN can be installed from the GitHub source code, or from PyPI using pip.
To find all instances of a search pattern in a structure, we first calculate the distances between all pairs of atoms in the search pattern. For there to be a match of the search pattern in the structure, there must be a list of atoms in the structure that share both the same distances (within a tolerance), and the same atom elements. Let rp be an N-length list of all positions in the search pattern. Let rs be the list of all atom positions in the structure plus each atom's periodic images from immediately adjacent unit cells. Let rm be an N-length list containing N positions from rs that we will examine as a trial match in the structure. We will refer to specific positions in both rp and rm as rp,i and rm,i, where i ∈ [1, …, N] refers to the ith element in the list.
The trial match rm is a good match if three conditions are met. The first condition is that the distances (or Euclidian norm denoted by ‖…‖) between all pairs of atoms in the trial pattern must match their corresponding pairs in the search pattern within a specified tolerance δ.
∀i, j ∈ [1, …, N], |‖rp,i − rp,j‖ − ‖rm,i − rm,j‖| < δ |
If the distance between any pair of atoms differs from its proposed matching pair by an amount greater than the tolerance, the trial match is not a match. The tolerance can be set higher or lower for cases when a looser or tighter match is appropriate.
The second condition is that the atom elements for the pattern must be the same as the atom elements of the trial match. If we let Ep,i be the ith element of the pattern and Em,i be the ith element for the trial match, then:
∀i ∈ [1, …, N], Ep,i = Em,i |
The third condition is that there must exist rotation and translation operations such that when they are applied to the search pattern, the atom positions of the search pattern match the atom positions in the trial match. This condition is necessary to handle cases of symmetry and chirality in the search pattern. For each trial match, we calculate the translation and rotation operations necessary to transform the search pattern to the location of the trial match and then we exclude any matches where the atom positions of the transformed search pattern are not the same as those in the trial match.
To rotate the search pattern into place, we select three points in the search pattern, two to define a direction axis, and the third to use as an orientation point. MOFUN will pick the two atoms in the search pattern that are farthest from each other to define the direction axis and it will pick the atom that is farthest from the infinite line defined by the direction axis to be the orientation point. If the pattern only contains two atoms, or all the atoms are colinear, then the orientation point can be ignored. Rotating the search pattern to align with the match pattern requires two rotation transformations that are implemented using quaternions: (1) we rotate the search pattern so that the two atoms of the direction axis are pointed in the same direction as those same atoms in the match pattern, and (2) we then rotate the search pattern around the directional axis so that the orientation point is in the correct direction. Since all the atoms should now be offset by the location of the match in space, we can translate the search pattern by the difference in position between any corresponding pair of atoms between the search pattern and the match pattern. This alignment procedure is fast as it is precisely defined and doesn't require an optimization, but for structures where the match positions do not closely match the search pattern (i.e. finding a pattern requires a tolerance δ), the resulting replacement atoms may be aligned sub-optimally proportional to the required tolerance.
To demonstrate the problem posed by symmetry, let us consider searching for the CH3 group at the ends of an octane molecule. We can define the CH3 search pattern by arbitrarily labeling one hydrogen as ‘A’, and then labeling the other hydrogens ‘B’, and ‘C’ clockwise as shown in Fig. 1A. The atom pairs (A, B) and (A, C) are equally far apart so we arbitrarily choose (A, B) to be the directional axis and C to be the orientation point. If we search an octane for this pattern, we will find six possible matches for each actual CH3 in the structure (or 12 matches total), one for each possible ordering of the hydrogens: ABC, ACB, BAC, BCA, CAB, CBA. If we look at the matches that start with the ‘A’ hydrogen – ABC, ACB – there is a clockwise ordering of atoms and a counter-clockwise ordering of atoms (see Fig. 1B). If the atoms in the match pattern are ordered clockwise like the search pattern and if we align the directional axis atoms (A, B) in the search pattern to the same atoms in the trial match and rotate the orientation point C into place (see Fig. 1C) then all three hydrogens and the carbon will be in the correct locations (see Fig. 1D). However, if the match pattern was numbered counter-clockwise (opposite numbering to the search pattern), and if we follow the same process to align (A, B) and rotate C into place, then the carbon will be in the wrong position even though all three hydrogens are correctly located. For this example, three of the six possible trial matches have the same counter-clockwise ordering as the search pattern and are good matches, and three trial matches have clockwise ordering and are not matches. Similarly, a chiral search pattern in a structure will match either enantiomorph since the distances between all atoms are the same regardless of which enantiomorph is found, but only an enantiomorph which matches the chirality of the search pattern will be able to be rotated into position of the match pattern.
Thus, the third condition – that the atom positions of a rotated and translated search pattern must be the same as the atom positions of the trial match – removes the bad matches caused by symmetric and chiral search patterns. Once we have all the possible good matches, if we still have multiple matches for the same group of atoms caused by symmetry, then we randomly pick one of the good matches.
At this stage, inserting the replacement atoms is now straightforward. The replacement pattern is defined on the same coordinate system as the search pattern, so to insert the replacement pattern into the structure at the right position and orientation, we take the transformations we calculated above to transform the search pattern into place and apply them to the replacement pattern. We insert the replacement pattern atoms and topology into the structure, delete the matched atoms and existing topology and our find and replace is complete.
First, depending on the length of the search pattern, we do not search all the atoms in each neighboring unit cell. Searching every atom in a replicated 3 × 3 × 3 expanded unit cell could be prohibitively inefficient for larger structures, so we limit the set of atoms searched to only those within a distance d of one of the boundaries of the original unit cell, where d is the length of the search pattern (see Fig. 2C).
Second, we do not generate all trial matches at once as this would lead to running out of memory for all but the smallest systems; instead, we build up trial matches of the search pattern one atom at a time. The first atom in the search pattern is matched by finding all atoms in the structure that share the same element as the first atom in the search pattern (see Fig. 2D). For each of these starting matching structure atoms, we create a list of nearby structure atoms – those atoms that are within a box x ± d, y ± d, z ± d about the matched atom (see Fig. 2E) – and precalculate the distances between every pair of atoms in this list (see Fig. 2F). The calculation for identifying nearby structure atoms is implemented as a filter on a presorted NumPy11 array of the coordinates and is therefore executed as highly-optimized compiled C code. The distances between all nearby atoms are precalculated as a group using the SciPy12 library, which is again executed as highly optimized C code. To match the second atom, we find all nearby structure atoms that match the element of the second search atom and select only the atoms where the distance from the second atom to the first atom matches that of the search pattern. Continuing through the search pattern one atom at a time, we match all nearby atoms based on their elements, and select only those atoms where the distances between the atom and all prior atoms match the distances in the search pattern. At any stage, if there are no viable matches, then we can abort looking for matches using this starting element. If we reach the last atom in the search pattern and have built up one or more complete matches (see Fig. 2G), then these are added to the list of successful matches. Another advantage of generating matches in this manner is that the algorithm can easily be made to operate in parallel, where each starting structure atom is run on a separate core. We have not found it necessary to implement this yet due to MOFUN's current performance being more-than sufficient for our needs but parallelization is available to us if it becomes necessary.
The performance of the optimized algorithm is O(N2), where N is the number of atoms in the system, and the memory usage is O(M2), where M is the maximum number of atoms within a distance d of any other atom. Practically, this means that system size and CPU speed will limit the kinds of systems that can be run. Our early naïve implementations took over 10 minutes to search for all linkers in a 2 × 2 × 2 replicated UiO-66 unit cell – 3496 atoms, 192 linkers – and outright failed for systems bigger than that due to running out of memory. With the optimized algorithm, we can search for and replace all linkers in an 8 × 8 × 8 replicated UiO-66 unit cell – 221184 atoms and 12
288 linkers – in less than 6 minutes on a single core of an Apple MacBook Pro M1 Pro laptop (see Fig. 3). This size system greatly exceeds our lab's current needs of approximately a 4 × 4 × 4 system with 40k atoms for use in thermal conductivity calculations; however, we did run a larger find and replace on a 20 × 20 × 20 system – 3.5M atoms and 192k linkers – and it completed in 15 hours. Additional optimizations could be implemented if there is a need for a find and replace operation at that scale.
![]() | ||
Fig. 3 Time required to find or find + replace all linkers in a UiO-66 unit cell replicated to 2 × 2 × 2, 4 × 4 × 4, 5 × 5 × 5, 6 × 6 × 6, 7 × 7 × 7, and 8 × 8 × 8. |
One of the more advanced features of MOFUN is that it can manage structure topology and force field parameters for flexible force fields and apply them correctly when inserting a parameterized replacement pattern into a structure. This supports the insertion of two-body pair and bond potentials, three-body angle potentials, and four-body dihedral and improper potentials, as defined by LAMMPS. For each defined potential in the replacement pattern, the atoms that make up the potential and the potential type are inserted into structure alongside the replacement pattern atom positions and types. If the structure already has defined topology, then any topology associated with the search pattern is deleted along with its atoms prior to insertion of the replacement pattern, with one very important exception: if any of the atoms are shared between both the search pattern and the replace pattern (i.e. if the atoms share the same element and position), then any force field potentials defined on these atoms are overridden, rather than all existing terms being deleted and replaced by what's in the replacement pattern. This is necessary to handle parameterizing a structure like the one shown below in example 3, or when using find and replace to override the force-field terms in part of a structure while leaving the structure intact. If the replacement pattern defines the potential parameters (i.e. via a “* coeffs” section in the LAMMPS data file), then the potential parameters will also be carried forward into the resulting structure. While this only supports LAMMPS data files (and direct code in Python) at the moment, this is primarily because there is no standardized file format that is commonly-used to define periodic molecular structures along with full topological data and force-field parameters. For LAMMPS users such as ourselves, writing to a LAMMPS data file is extremely convenient as we can immediately simulate systems after find and replace; for users of other simulation packages, the LAMMPS data file should contain sufficient information to be converted into the file formats required by other simulation packages.
When we are functionalizing a MOF using find and replace, we are typically replacing a pattern that has fewer atoms with a pattern that has more, and the larger the functional group is in the replacement pattern, the more likely that functional groups from different linkers will overlap. This may not be a problem, for example, when adding hydroxyl groups to the linker in UiO-66, but if one were to add more bulky functional groups overlap would likely occur. The replacement operation inserts the functional group exactly as specified, and the resulting structure may need to be relaxed using molecular dynamics for the functional group to find a more reasonable configuration. When we are adding bulky functional groups to a structure, we create a replacement pattern where the functional groups are tightly placed near the linker, as much parallel to the linker as possible, to limit any overlapping with functional groups on other linkers. While this tight configuration may be high in energy, since we then relax the structure using a flexible force field, the functional groups can relax into a lower energy configuration.
For this example, we will introduce defects into UiO-66 by randomly removing 25% of the linkers from the structure. We will first replicate the structure to a 2 × 2 × 2 so it fulfills minimum image conventions, which needs to be done before adding defects so that the defects aren't repeated in the structure. We can reuse the structure and search pattern files from example 1, but we will need to create a replacement pattern from the search pattern where the biphenyl ring is removed and replaced with formate caps where the linker would attach to the metal center (see Fig. 4C). This replacement pattern can be created in a similar manner to that described in example 1.
One tactic to overcome these challenges for structures that can be deconstructed into distinct parts is to assign force-field parameters to the constituent parts of the structure and then use find and replace to apply the force field to the entire structure. In this example, we apply this technique to MOFs: we assign parameters to a MOF's metal center and linker and then replace all unparameterized metal centers and linkers in the full structure with their corresponding parameterized versions.
The patterns for the parameterized linker and the parameterized metal center will need to overlap; every desired force-field term will need to be included fully in at least one of the patterns so some atoms and force-field terms will be defined in both files. For 2-body terms, only the atom that connects the metal center to the linker needs to be shared between the patterns. For 3-body (angle) or 4-body (dihedral/improper) terms there will need to be two or three atoms of overlap, respectively (see Fig. 5). Linker and metal center files can be prepared similarly to example 1 above and parameterized manually, or possibly in an automated manner using the rough UFF parameterizer included in MOFUN (information on the UFF parameterizer is beyond the scope of this paper, but can be found in the documentation online), or with other packages.26 For Python:
This journal is © The Royal Society of Chemistry 2022 |