Open Access Article

This Open Access Article is licensed under a

Creative Commons Attribution 3.0 Unported Licence

DOI: 10.1039/C9SC02742D
(Edge Article)
Chem. Sci., 2019, Advance Article

Stephanie R.
Hare
^{ab},
Lars A.
Bratholm
^{ab},
David R.
Glowacki
^{ac} and
Barry K.
Carpenter
*^{d}
^{a}University of Bristol School of Chemistry, Cantock's Close, Bristol, UK BS8 1TS
^{b}University of Bristol School of Mathematics, University Walk, Bristol, UK BS8 1TW
^{c}University of Bristol School of Computer Science, Merchant Venturers Building, Woodland Road, Bristol, UK BS8 1UB
^{d}Cardiff University School of Chemistry, Main Building, Park Place, Cardiff, UK CF10 3AT. E-mail: carpenterb1@cardiff.ac.uk

Received
5th June 2019
, Accepted 23rd August 2019

First published on 18th September 2019

Most chemical transformations (reactions or conformational changes) that are of interest to researchers have many degrees of freedom, usually too many to visualize without reducing the dimensionality of the system to include only the most important atomic motions. In this article, we describe a method of using Principal Component Analysis (PCA) for analyzing a series of molecular geometries (e.g., a reaction pathway or molecular dynamics trajectory) and determining the reduced dimensional space that captures the most structural variance in the fewest dimensions. The software written to carry out this method is called PathReducer, which permits (1) visualizing the geometries in a reduced dimensional space, (2) determining the axes that make up the reduced dimensional space, and (3) projecting the series of geometries into the low-dimensional space for visualization. We investigated two options to represent molecular structures within PathReducer: aligned Cartesian coordinates and matrices of interatomic distances. We found that interatomic distance matrices better captured non-linear motions in a smaller number of dimensions. To demonstrate the utility of PathReducer, we have carried out a number of applications where we have projected molecular dynamics trajectories into a reduced dimensional space defined by an intrinsic reaction coordinate. The visualizations provided by this analysis show that dynamic paths can differ greatly from the minimum energy pathway on a potential energy surface. Viewing intrinsic reaction coordinates and trajectories in this way provides a quick way to gather qualitative information about the pathways trajectories take relative to a minimum energy path. Given that the outputs from PCA are linear combinations of the input molecular structure coordinates (i.e., Cartesian coordinates or interatomic distances), they can be easily transferred to other types of calculations that require the definition of a reduced dimensional space (e.g., biased molecular dynamics simulations).

In this article, we outline a dimensionality reduction method incorporating principal component analysis (PCA). PCA is an extremely popular method in various fields: in experimental biology, PCA is used to determine the effects of different gene expressions.^{15–17} In analytical chemistry, PCA is central in the development of quantitative structure activity relationship (QSAR) models, of particular utility in the pharmaceutical industry.^{18–21} Perhaps most closely related to this study is the use of PCA in computational biology, to capture essential motions of a protein in MD simulations.^{22–24} There are, however, still some key limitations of PCA: first, it is assumed that the relationships between features of the data points are linear. Second, principal components must be orthogonal to one another, so some types of coupled motions may not be well-described (i.e., related to the first point, motions that are coupled in non-linear relationships). Third, because PCA aims to pick principal components along which the variance of the data is maximized, some shapes of the data distribution can end up being described poorly (e.g., two “bands” of data, or stacked “pancakes” of data points).^{25–27} Despite these limitations, for the applications described herein, PCA does an excellent job of defining a reduced dimensional space, without losing too much structural information along the chemical pathways examined, and the issue of capturing non-linear motions can be mitigated by adjusting the representations of molecular structures that are input to PCA. Despite its utility and the fact that reaction coordinates of small-molecule systems are not as susceptible as those of larger systems to suffer from the aforementioned limitations, as far as we know, PCA is not commonly utilized for the visualization of small-molecule chemical change.

For computational studies of large biomolecular systems occurring over long timescales, a suitable choice of collective variables is necessary for modelling dynamics, and thus many dimensionality reduction techniques in addition to PCA have been explored in the field. For example, in the realm of Markov state models, many in the computational community have chosen to employ time-lagged (or time-structure based) independent component analysis (TICA)^{28} rather than PCA. TICA aims to maximize the autocorrelation for a given lag time, rather than the variance, and so is better able to resolve slow timescale events, which is better for capturing the slow dynamics of large molecules like enzymes.^{26,29} Diffusion maps constitute a dimensionality reduction technique that does not assume the data points to be related linearly, but instead seeks to determine the manifold in which the data live.^{30–32} For the small-molecule applications discussed below, where we are not considering very large systems occurring over large timescales and particularly because we are focusing on intrinsic reaction coordinates (IRCs) rather than MD trajectories to define a reduced dimensional space, we chose to use PCA in order to determine the optimal reduced dimensional space for these example systems. The methods described herein are provided in an open-source software package named PathReducer, which allows the user to decide whether their system is best described by linear combinations of Cartesian coordinates or squared interatomic distances, and also whether they would like these inputs to be mass-weighted prior to processing. The merits of all options as applied to several example systems are discussed in the results section, below.

In this paper, we have three principal goals. The first is to introduce the application of PCA into the field of small-molecule computational chemistry, where its value may not have been as widely recognized as it has been in computational biology. The second is to show the utility of using PCA to analyze and characterize chemical pathway data. In particular, we show that a variant of PCA in which the input data are squared internal distances can have advantages over the version in which Cartesian coordinates are used. Additionally, by using a reduced dimensional space defined by an IRC and projecting MD trajectory data into this space, one can quickly classify the routes taken by trajectories compared to the minimum energy path. The third objective is to provide our code, PathReducer: an easy-to-use code for computational chemists to reduce the dimensionality of their molecular systems.

(a) A series of molecular geometries (e.g., an IRC, a trajectory, a relaxed potential energy surface scan) in xyz file format;

(b) n_{dim}, the number of dimensions for the low-dimensional space (often two or three dimensions would be most useful for visualization);

(c) Whether the user wants PCA analysis to be carried out on mass weighted input coordinates;

(d) Optional labels of four atoms surrounding a stereogenic center of the molecule in order to define chirality (this is only necessary when defining the molecular structures as squared interatomic distance matrices, discussed in more detail below);

(e) The representation of the IRC/trajectory upon which the user wants to perform PCA. The user can specify that PCA be performed on the aligned Cartesian coordinates of the structures (keyword “Cartesians”) or on the upper triangle of the squared interatomic distance matrices of the structures (keyword “Distances”).

The full distance matrix representation is less suitable for very large systems as the size of the representations scales as , with N being the number atoms, whereas the aligned Cartesian coordinate representation scales with . Using internal distances, however, provides a more accurate reduced dimensional representation in fewer dimensions when non-linear motions (e.g. torsions) are involved in the reaction pathway. Additionally, the output from using interatomic distance matrices as input to PCA is more suitable for use in free energy sampling methods since the representation is rotationally and translationally invariant.

(1) |

Λ_{C} = U_{C}CU^{T}_{C}, | (2) |

(3) |

= T_{i}·W_{i} + , | (4) |

= T_{i:ndim}·W_{i:ndim} + , | (5) |

If using a “Cartesians” input to PCA, this is the last step prior to generating output because the reconstructed structures are in Cartesian space. In the case of using the “Distances” input, the structures that have been transformed into reduced dimensional space at this point are still vectors representing the upper triangle of interatomic distance matrices, and so each row then needs to be converted from squared distances to Cartesian coordinates.^{38} These steps represent the most computationally expensive part of the procedure, as a matrix diagonalization must be done for each molecular structure (step 7D in Fig. 1). The reconstruction of Cartesian coordinates from the flattened, reduced dimensional distance matrices requires the following: each vector is converted back into a square, symmetric matrix with zeroes along the diagonal (step 4D in Fig. 1). The Gram matrix, G, for each internal distance matrix is then calculated by:

(6) |

Λ_{G} = U_{G}GU^{T}_{G} | (7) |

The approximate reconstruction of the Cartesian coordinates is given by the first three columns of the matrix generated by taking dot product of the eigenvectors and the square root of their corresponding eigenvalues, Λ^{1/2}_{G}U^{T}_{G}. It should be noted that because the reduced dimensional distance matrix, D, is not a true distance matrix, but rather what is referred to as a “predistance matrix”,^{39} there will be trailing values in the reconstruction matrix Λ^{1/2}_{G}U^{T}_{G} beyond the first three columns that are a result of the fact that some structural information is lost by reducing the dimensionality of the system. If D was a true distance matrix, only the first three columns of Λ^{1/2}_{G}U^{T}_{G} would be nonzero. Additionally, because information about the absolute rotational/reflective configuration is also lost in representing each of the structures as internal distance matrices, these structures will be in an arbitrary rotational/reflective configuration. For the sake of visualization, the Kabsch algorithm,^{36} which determines the optimal rotation matrix to minimize RMSD between pairs of points, is used to align structures along the IRC.

Structures along the reconstructed pathway are reflected if the chirality of the structure at a particular point is not consistent with the analogous structure in the original file (step 7D in Fig. 1). The optional input of four atoms surrounding the stereogenic center are used to determine the chirality of the structure at each point by the method in ref. 40. The sign of the following fourth-grade determinant is used to assign the chirality of the structure:

(8) |

If coordinates were mass-weighted, mass-weighting of the coordinates is removed according to the following equation (step UMW in Fig. 1):

(9) |

Fig. 3a and b show the results obtained when PathReducer is used to represent the structures along the malonaldehyde IRC as squared internal distance matrices that are input to PCA. Fig. 3c shows that the first principal component (PC1) describes 87.0% of the variance, while PC2 accounts for 12.8%. As these components capture more than 99% of the total variance in the geometrical changes along the IRC, we conclude that the important molecular motions are captured by this two-dimensional space. Performing PCA on the aligned Cartesian coordinates gives very similar results, which are shown in the ESI.†

Fig. 4 shows that the most significant principal component (PC1) corresponds to motion of the hydrogen atom between the two carbonyl oxygens and alternating single and double bond character of the two C–C bonds. The second most significant principal component (PC2) corresponds predominantly to inward motion of the carbonyl oxygens, where the oxygens are farthest apart in the reactant and product structures and closest together at the transition state structure. For videos of these PCs, see https://vimeo.com/335614575 for PC1 and https://vimeo.com/335614565 for PC2. See the ESI† for corresponding xyz files of these PCs.

Fig. 5 The IRC for the S_{N}2 system. Structures A, B, and D represent the reactant, transition state, and product structures (after the fluoride ion orbits the system to hydrogen bond with the hydroxyl group), respectively. Structure C represents the structure where PC2 is at a minimum (Fig. 6), which can be thought of as the structure of the system when the fluoride ion has fully dissociated from the methanol, but has not yet come back to hydrogen bond with the hydroxyl group. |

In this system, with squared interatomic distances as input to PCA, PC1 accounted for 78.7% of the variance, PC2 for 14.5%, and PC3 for 4.9% (Fig. 6c, below).

Visualizations of the geometric changes along the top two principal components can be found in Fig. 7. PC1 represents a pathway that looks quite similar to the original IRC, where the fluoride ion dissociates from methanol and then orbits around the molecule to interact with the hydroxyl group. For a video of PC1, see https://vimeo.com/335614633. PC2 represents an almost periodic motion (as can be seen in Fig. 6a, where PC2 starts at a maximum, reaches a minimum, and then returns near to the same maximum) of methyl group pyramidalization and O–H bond stretching. For a video of PC2, see https://vimeo.com/335614625. Corresponding xyz files for these PCs can be found in the ESI.†

Fig. 6 Structures from the S_{N}2 IRC shown in Fig. 5 projected into the (a) top two and (b) top three principal components of the S_{N}2 system when using squared interatomic distances as the representations of structures that are input to PCA. The locations of structures A, B, C, and D from Fig. 5 with respect to these principal components are labelled. (c) The proportion of variance described by each principal component. |

New data for a system can also be projected into a defined reduced dimensional space. To illustrate this, a MD trajectory was initiated from the S_{N}2 system's transition state structure (structure B, Fig. 5–7) for 500 steps of 1 fs and propagated in the product direction. As was observed in most of the trajectories calculated by Tsutsumi et al.,^{42} after dissociating, the fluoride ion did not orbit the resultant methanol and hydrogen bond with the hydroxide group, but rather dissociated completely and did not re-associate for the duration of the trajectory (500 fs). As can be seen in Fig. 8, there is oscillatory movement with the amplitude in the direction of PC1 and almost linear movement in the direction of PC2. This oscillation reflects the excess energy in the forming C–O bond vibration (reflected in PC1) and progression along PC2 is consistent with the C–F distance increasing. Though this reduced dimensional space was defined only by the structures along the S_{N}2 IRC, it can be quickly seen from the projection of an MD trajectory in the reduced dimensional space that the dynamical path is very different than the IRC path. In addition to showing that MD trajectory paths can be very different from IRC paths, this example illustrates that PathReducer can be used as a straightforward way to classify reaction pathways generated by different types of molecular simulations. Plots of the results when using aligned Cartesian coordinates to represent the molecular structures can be found in the ESI† and look similar to those generated when using squared interatomic distances as input to PCA.

Fig. 8 MD trajectory for fluoride dissociation projected into the reduced dimensional space defined by the IRC for the S_{N}2 system with respect to (a) the top two principal components and (b) the top three principal components. The IRC, as in previous plots, is shown by the purple to yellow color-mapped line and the trajectory is shown with the same color-mapping, but a black outline. The equivalent plots for PCA on the aligned Cartesian coordinates can be found in the ESI.† |

Fig. 10 shows the IRC projected onto the reduced dimensional space. In this case, two principal components are enough to describe over 99% of the variance in the system. However, using interatomic distances to represent the structures as input to PCA resulted in the first principal component accounting for 93.3% of the variance in the system, whereas an aligned Cartesian coordinates representation of the structures meant the first principal component only accounted for 82.0% of the variance. This result implies that interatomic distance matrices as input to PCA are better for handling torsions in a smaller number of principal components. Thus, if torsions are suspected to be one of the major types of geometric changes along the course of an IRC or trajectory, using the “Distances” input option is likely a better choice (though, if possible, both methods should be screened).

Fig. 10 (a) Dihedral scan geometries from Fig. 9 projected into the top two PCs and (b) the proportion of variance explained by each principal component of the N_{2}O–acrylonitrile complex system using aligned Cartesian coordinates as input to PCA. (c) Dihedral scan geometries from Fig. 9 projected into the top two PCs and (d) the proportion of variance explained by each principal component of the N_{2}O–acrylonitrile complex system using squared interatomic distances as input to PCA. |

This point can be illustrated by examining the effects of the top principal components on the geometries along the acrylonitrile scan. When performing PCA on the aligned Cartesian coordinates, PC1 significantly compresses the N_{2}O moiety during the torsion in order to emulate the effect of a dihedral rotation, while this is not the case when using squared interatomic distances. This is particularly evident in the middle frames shown in Fig. 11a and b. Similarly, a squared interatomic distances representation more accurately preserves the bond distances of the N_{2}O moiety (Fig. 11c). See https://vimeo.com/336110236 for a video of PC1 using aligned Cartesian coordinates as input to PCA and https://vimeo.com/335614657 for PC1 using interatomic distances as input to PCA.

Fig. 12 A comparison of the bifurcating reactions in (a) the previous study by Carpenter et al.^{43} looking at the effects of chiral solvent on enantiomeric induction and (b) the current study. Both reactions involve the ring-opening of cyclopropylidene to generate enantiomeric allenes, but N_{2} (in blue) was included as a leaving group in the previous study. |

Systems with post-transition state bifurcations occur in cases where a single transition state structure connects a reactant to two separate products, without any intermediate minima or secondary barriers along the downhill path to either product. If one were to take the upper saddle point structure on the PES as the transition state structure and follow the steepest descent path in the reactant and product directions, where two products are related by symmetry (e.g., enantiomers) the steepest descent path on the product side would pass by a valley-ridge inflection (VRI) point before reaching a minimum. In the case of unsymmetrical bifurcations, there would not be a VRI, but still an additional exit channel with no intervening minima or barriers to overcome. In either case, the IRC would not illustrate the connection between the saddle point and the second possible minimum, as, mathematically, there can only be one steepest descent path. We chose this system to test as input to PathReducer because bifurcating reactions represent a class of chemical change whose dynamics are often important, but which have very rarely been visualized using actual structural data and are more often illustrated on qualitative surfaces that illustrate the location of a VRI.^{46,48–54}

While an IRC calculation necessarily picks a single pathway as the minimum energy path, a “bifurcating” IRC could in this case be constructed by a 180° torsion about the C_{1}–C_{3} axis (see Fig. 13) for each structure following the branching point. Note that reflecting each point along the IRC after the point where the paths split would artificially change the atom labels and would cause the distance matrices for the pathway to each product to be identical, and thus would not be able to show the paths splitting. To avoid this, we keep the atom labels consistent with those that would be obtained by a torsional rotation.

Fig. 14 illustrates that representing structures along symmetric bifurcating reaction paths using interatomic distance matrices does a good job of illustrating the path “splitting” before leading to the two possible products, whose locations are shown by the yellow ends of the paths. The top three principal components account for 77.6%, 11.8%, and 10.0% of the variance in the IRC, respectively. The equivalent plot using the “Cartesians” input to PCA can be found in the ESI.†

As with the S_{N}2 system, MD trajectories for the cyclopropylidene bifurcation were initiated from the transition state structure and propagated in the product direction. Fig. 15 shows these trajectories projected into the reduced dimensional space defined by the bifurcating IRC. The MD trajectories do not follow the IRC path very closely, indicating that dynamic properties of molecules should not be deduced from IRCs alone. Assigning the product made in each case (if a product is even made) is not entirely straightforward, as illustrated in the original trajectory videos (found at https://vimeo.com/336131095 for trajectory A, https://vimeo.com/336131066 for trajectory B, https://vimeo.com/336131042 for trajectory C, and https://vimeo.com/336131137 for trajectory D. See the ESI† for corresponding xyz files). However, projecting these trajectories into the reduced dimensional space defined by the IRC enables rapid qualitative insights into the routes taken by any particular trajectory. Fig. 15a shows a trajectory in which the cyclopropyl ring opens but lingers in the bifurcation region without committing to a clear product pathway. Fig. 15b and c show trajectories which are heading toward generating a single product (enantiomer 2). Fig. 15d is rather different: it goes along the pathway toward enantiomer 1 before traversing the region between the two possible products, a consequence of the fact that the trajectories illustrated in Fig. 15 are run in the gas phase at a constant total energy (NVE ensemble). Therefore, once the molecule goes down the potential energy “hill” after the transition state structure, the molecule has significant excess energy with nowhere to dissipate, which enables interconversion between different product states through high energy geometries.

Seeing MD trajectories projected into a reduced dimensional space defined by an IRC in this way offers a unique perspective on the utility of IRCs compared to MD simulations. While MD trajectories arguably model real, room temperature reactions more accurately by including the effects of finite energy and temperature, this kinetic energy adds noise to the pathway from reactant to product(s). An IRC, however, shows the minimum energy pathway from reactant to product(s); viewed another way, the IRC is the minimum atomic motion necessary for a transformation. In this sense, the IRC provides a sort of “skeleton” characterizing the transformation of interest, which is very useful to aid in product classification of MD trajectories. Defining a reduced dimensional space based on an IRC and projecting MD trajectories into this space offers a simple and efficient way to characterize the pathways of MD trajectories in a quantitative comparison to the IRC.

- The type of surface being referred to here can be seen in
(
*a*) S. R. Hare and D. J. Tantillo, Chem. Sci., 2017, 8, 1442–1449 RSC; (*b*) S. R. Hare, R. P. Pemberton and D. J. Tantillo, J. Am. Chem. Soc., 2017, 139, 7485–7493 CrossRef CAS PubMed; (*c*) S. R. Hare, A. Li and D. J. Tantillo, Chem. Sci., 2018, 9, 8937–8945 RSC. - C. Liu, C. T. Kelley and E. Jakubikova, J. Phys. Chem. A, 2019, 123, 4543–4554 CrossRef CAS PubMed.
- X. S. Xue, C. S. Jamieson, M. Garcia-Borras, X. Dong, Z. Yang and K. N. Houk, J. Am. Chem. Soc., 2019, 141, 1217–1221 CrossRef CAS PubMed.
- Z. Yang, P. Yu and K. N. Houk, J. Am. Chem. Soc., 2016, 138, 4237–4242 CrossRef CAS PubMed.
- L. Xu, C. E. Doubleday and K. N. Houk, J. Am. Chem. Soc., 2010, 132, 3029–3037 CrossRef CAS PubMed.
- G. Jimenez-Oses, P. Liu, R. A. Matute and K. N. Houk, Angew. Chem., Int. Ed., 2014, 53, 8664–8667 CrossRef CAS PubMed.
- A. Patel, Z. Chen, Z. Yang, O. Gutierrez, H. W. Liu, K. N. Houk and D. A. Singleton, J. Am. Chem. Soc., 2016, 138, 3631–3634 CrossRef CAS PubMed.
- E. L. Noey, Z. Yang, Y. Li, H. Yu, R. N. Richey, J. M. Merritt, D. P. Kjell and K. N. Houk, J. Org. Chem., 2017, 82, 5904–5909 CrossRef CAS PubMed.
- J. Kästner, Wiley Interdiscip. Rev.: Comput. Mol. Sci., 2011, 1, 932–942 Search PubMed.
- A. Barducci, M. Bonomi and M. Parrinello, Wiley Interdiscip. Rev.: Comput. Mol. Sci., 2011, 1, 826–843 CAS.
- J. Booth, S. Vazquez, E. Martinez-Nunez, A. Marks, J. Rodgers, D. R. Glowacki and D. V. Shalashilin, Philos. Trans. R. Soc., A, 2014, 372, 20130384 CrossRef PubMed.
- M. O'Connor, E. Paci, S. McIntosh-Smith and D. R. Glowacki, Faraday Discuss., 2016, 195, 395–419 RSC.
- R. J. Allen, C. Valeriani and P. Rein Ten Wolde, J. Phys.: Condens. Matter, 2009, 21, 463102 CrossRef PubMed.
- A. K. Faradjian and R. Elber, J. Chem. Phys., 2004, 120, 10880–10889 CrossRef CAS PubMed.
- E. Lotfi and A. Keshavarz, Comput. Biol. Med., 2014, 54, 180–187 CrossRef CAS PubMed.
- M. J. Wongchenko, G. A. McArthur, B. Dreno, J. Larkin, P. A. Ascierto, J. Sosman, L. Andries, M. Kockx, S. D. Hurst, I. Caro, I. Rooney, P. S. Hegde, L. Molinero, H. Yue, I. Chang, L. Amler, Y. Yan and A. Ribas, Clin. Cancer Res., 2017, 23, 5238–5245 CrossRef CAS PubMed.
- S.-L. Wang, M. Li and H. Wang, Using 2D Principal Component Analysis to Reduce Dimensionality of Gene Expression Profiles for Tumor Classification, in Bio-Inspired Computing and Applications, ICIC 2011, ed. D. S. Huang, Y. Gan, P. Premaratne and K. Han, Lecture Notes in Computer Science, Berlin, Heidelberg, 2012, vol. 6840 Search PubMed.
- B. Hemmateenejad, R. Miri and M. Elyasi, J. Theor. Biol., 2012, 305, 37–44 CrossRef CAS PubMed.
- J. B. Vieira, F. S. Braga, C. C. Lobato, C. F. Santos, J. S. Costa, J. A. Bittencourt, D. S. Brasil, J. O. Silva, L. I. Hage-Melim, W. J. Macedo, J. C. Carvalho and C. B. Santos, Molecules, 2014, 19, 10670–10697 CrossRef PubMed.
- C. Yoo and M. Shahlaei, Chem. Biol. Drug Des., 2018, 91, 137–152 CrossRef CAS PubMed.
- M. Shahlaei, A. Madadkar-Sobhani, A. Fassihi, L. Saghaie and E. Arkan, Med. Chem. Res., 2011, 21, 3246–3262 CrossRef.
- A. Amadei, A. B. M. Linssen and H. J. C. Berendsen, Proteins: Struct., Funct., Genet., 1993, 17, 412–425 CrossRef CAS PubMed.
- C. J. Woods, M. Malaisree, N. Pattarapongdilok, P. Sompornpisut, S. Hannongbua and A. J. Mulholland, Biochemistry, 2012, 51, 4364–4375 CrossRef CAS PubMed.
- A. Shkurti, R. Goni, P. Andrio, E. Breitmoser, I. Bethune, M. Orozco and C. A. Laughton, SoftwareX, 2016, 5, 44–50 CrossRef.
- I. T. Jolliffe and J. Cadima, Philos. Trans. R. Soc., A, 2016, 374, 20150202 CrossRef PubMed.
- G. Perez-Hernandez, F. Paul, T. Giorgino, G. De Fabritiis and F. Noe, J. Chem. Phys., 2013, 139, 015102 CrossRef.
- J. Lever, M. Krzywinski and N. Altman, Nat. Methods, 2017, 14, 641–642 CrossRef CAS.
- L. Molgedey and H. G. Schuster, Phys. Rev. Lett., 1994, 72, 3634–3637 CrossRef PubMed.
- Y. Naritomi and S. Fuchigami, J. Chem. Phys., 2011, 134, 065101 CrossRef PubMed.
- R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner and S. W. Zucker, Proc. Natl. Acad. Sci. U. S. A., 2005, 102, 7426–7431 CrossRef CAS.
- R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner and S. W. Zucker, Proc. Natl. Acad. Sci. U. S. A., 2005, 102, 7432–7437 CrossRef CAS.
- R. R. Coifman and S. Lafon, Appl. Comput. Harmon. Anal., 2006, 21, 5–30 CrossRef.
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and É. Duchesnay, J. Mach. Learn. Res., 2011, 12, 2825–2830 Search PubMed.
- L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. Vanderplas, A. Joly, B. Holt and G. Varoquaux, 2013, arXiv: abs/1309.0238v1.
- J. C. Kromann, Calculate Root-Mean-Square Deviation (RMSD) of Two Molecules Using Rotation, https://github.com/charnley/rmsd/commit/cd8af49, accessed May 2019 Search PubMed.
- W. Kabsch, Acta Crystallogr., Sect. A: Cryst. Phys., Diffr., Theor. Gen. Crystallogr., 1976, 32, 922–923 CrossRef.
- It is worth noting that the term “goodness of fit” might be misleading, as the proportion of variance described by the model is not the only factor that determines what is a well-fitting model. The term is being used here for the sake of consistency with the literature.
- I. Dokmanić, R. Parhizkar, J. Ranieri and M. Vetterli, IEEE Signal Process. Mag., 2015, 32, 12–30 Search PubMed.
- W. Glunt, T. L. Hayden and W.-M. Liu, Bull. Math. Biol., 1991, 53, 769–796 CrossRef CAS.
- T. Cieplak and J. L. Wisniewski, Molecules, 2001, 6, 915–926 CrossRef CAS.
- A recent paper by Tsutsumi et al. presented a dimensionality reduction method using Classical Multidimensional Scaling (CMDS). They employ CMDS on both IRCs and global reaction route maps to generate two-dimensional mappings of reaction pathways. The main difference between CMDS and PCA is the nature of the input to each method: PCA uses numerical variables (sometimes referred to as “features” of the data set) available for the input data elements (with each element being, for example, a set of Cartesian coordinates or the corresponding matrix of squared interatomic distances) whereas CMDS uses distances between data elements (e.g., distances between different sets of Cartesian coordinates). The goal of CMDS is to take a matrix of squared distances between data points and project the points in a lower dimensional space that preserves the distances between those points the best. A procedure like this does provide useful information about relative locations between data points, but does not inherently provide anything beyond a mapping. CMDS is also not typically used to transform the data in reduced dimensions back into the full-dimensional space.
- T. Tsutsumi, Y. Ono, Z. Arai and T. Taketsugu, J. Chem. Theory Comput., 2018, 14, 4263–4270 CrossRef CAS.
- B. K. Carpenter, J. N. Harvey and D. R. Glowacki, Phys. Chem. Chem. Phys., 2015, 17, 8372–8381 RSC.
- M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, G. Scalmani, V. Barone, B. Mennucci, G. A. Petersson, H. Nakatsuji, M. Caricato, X. Li, H. P. Hratchian, A. F. Izmaylov, J. Bloino, G. Zheng, J. L. Sonnenberg, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao, H. Nakai, T. Vreven, J. A. Montgomery, J. E. Peralta, F. Ogliaro, M. Bearpark, J. J. Heyd, E. Brothers, K. N. Kudin, V. N. Staroverov, R. Kobayashi, J. Normand, K. Raghavachari, A. Rendell, J. C. Burant, S. S. Iyengar, J. Tomasi, M. Cossi, N. Rega, J. M. Millam, M. Klene, J. E. Knox, J. B. Cross, V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E. Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W. Ochterski, R. L. Martin, K. Morokuma, V. G. Zakrzewski, G. A. Voth, P. Salvador, J. J. Dannenberg, S. Dapprich, A. D. Daniels, O. Farkas, J. B. Foresman, J. V. Ortiz, J. Cioslowski and D. J. Fox, Gaussian 09, Revision D.02, Wallingford, CT, 2009 Search PubMed.
- J. A. Pople, K. Raghavachari, H. B. Schlegel and J. S. Binkley, Int. J. Quantum Chem., 1979, 16, 225–241 CrossRef.
- D. H. Ess, S. E. Wheeler, R. G. Iafe, L. Xu, N. Çelebi-Ölçüm and K. N. Houk, Angew. Chem., Int. Ed., 2008, 47, 7592–7601 CrossRef CAS.
- S. R. Hare and D. J. Tantillo, Pure Appl. Chem., 2017, 89, 679–698 CAS.
- D. M. Birney, Curr. Org. Chem., 2010, 14, 1658–1668 CrossRef CAS.
- X. S. Bogle and D. A. Singleton, Org. Lett., 2012, 14, 2528–2531 CrossRef CAS PubMed.
- P. Collins, B. K. Carpenter, G. S. Ezra and S. Wiggins, J. Chem. Phys., 2013, 139, 154108 CrossRef PubMed.
- S. R. Hare and D. J. Tantillo, Beilstein J. Org. Chem., 2016, 12, 377–390 CrossRef CAS PubMed.
- S. Maeda, Y. Harabuchi, Y. Ono, T. Taketsugu and K. Morokuma, Int. J. Quantum Chem., 2015, 115, 258–269 CrossRef CAS.
- A. N. Sheppard and O. Acevedo, J. Am. Chem. Soc., 2009, 131, 2530–2540 CrossRef CAS.
- M. R. Siebert, P. Manikandan, R. Sun, D. J. Tantillo and W. L. Hase, J. Chem. Theory Comput., 2012, 8, 1212–1222 CrossRef CAS PubMed.
- M. Ernst, F. Sittel and G. Stock, J. Chem. Phys., 2015, 143 Search PubMed.
- T. E. Oliphant, Guide to NumPy, Trelgol Publishing, United States, 2006 Search PubMed.
- J. D. Hunter, Comput. Sci. Eng., 2007, 9, 90–95 Search PubMed.
- W. Humphrey, A. Dalke and K. Schulten, J. Mol. Graphics, 1996, 14, 33–38 CrossRef CAS.

## Footnote |

† Electronic supplementary information (ESI) available: The ESI contains the following input xyz files and PathReducer output xyz files, for mass-weighted and not mass-weighted coordinates and for both “Cartesians” and “Distances” inputs to PCA: (1) input files: (a) malonaldehyde system IRC, (b) S_{N}2 system (i) IRC (ii) MD trajectory, (c) N_{2}O–acrylonitrile system dihedral scan, (d) cyclopropylidene bifurcation system (i) IRC (ii) four MD trajectories (A–D). (2) Output files: (a) malonaldehyde system (i) PC1, PC2, PC3 (ii) PCall (PC1–3 combined), (b) S_{N}2 system (i) PC1, PC2, PC3 for IRC (ii) PCall (PC1–3 combined) for IRC (iii) PC1, PC2, PC3 for MD trajectory (iv) PCall (PC1–3 combined) for MD trajectory, (c) N_{2}O–acrylonitrile system (i) PC1, PC2, PC3 using (ii) PCall (PC1–3 combined), (d) cyclopropylidene bifurcation system (i) PC1, PC2, PC3 for IRC (ii) PCall (PC1–3 combined) for IRC (iii) PC1, PC2, PC3 for four MD trajectories (A–D) (iv) PCall (PC1–3 combined) for four MD trajectories (A–D). Relevant plots for all of these systems and all possible input combinations are also included in the ESI. The input file for doing BOMD simulations in Gaussian 09 is also included. See DOI: 10.1039/c9sc02742d |

This journal is © The Royal Society of Chemistry 2019 |