Low dimensional representations along intrinsic reaction coordinates and molecular dynamics trajectories using interatomic distance matrices

Principal Component Analysis on a series of molecular geometries (e.g., a reaction coordinate or trajectory) provides maximum structural variance in the fewest dimensions, and so can offer an objective, comprehensible depiction of the transformation.


Introduction
Chemical reaction pathways and structural transformations occurring on hyperdimensional potential energy surfaces (PESs) can be difficult to comprehend due to the high number of degrees of freedom available in most molecular systems. The use of reaction coordinate diagrams and reduced dimensional potential energy surface scans 1 (RDPESs) has already demonstrated the utility of viewing chemical reactions in a small number of dimensions. These approximate RDPESs are oen made by incrementally varying a small number of geometric features and plotting the values of potential energy as a function of these features to generate a low-dimensional surface. For example, a recent paper by Liu et al. details a method of using RDPESs on which to conduct ab initio molecular dynamics (MD) simulations where the RDPESs were constructed using geometric coordinates "chosen based on the chemical knowledge of the system." 2 In addition to generating RDPESs, similar approaches (e.g., choosing specic bond distances, angles, and dihedrals along the course of trajectories as in ref. [3][4][5][6][7][8] are oen used to plot several MD trajectories, and to carry out free energy sampling [e.g., using methods like umbrella sampling, 9 metadynamics, 10 boxed molecular dynamics (BXD), 11,12 forward ux sampling, 13 milestoning, 14 all of which require a well-dened reduced dimensional space of collective variables from which to sample]. In general, these sorts of analyses tend to rely heavily on user input, i.e., the person making the surface uses their chemical intuition to pick geometric criteria that will make the analysis useful. However, by inferring the geometric changes most important to a reaction and calculating the energy of structures along those coordinates, one runs the risk of conrming one's own biases, and neglecting potentially important degrees of freedom. In a variety of realms, it is therefore useful to have an automated method for generating low-dimensional representations to describe structural changes along molecular pathways that is quantitatively and a priori derived from the input data.
In this article, we outline a dimensionality reduction method incorporating principal component analysis (PCA). PCA is an extremely popular method in various elds: in experimental biology, PCA is used to determine the effects of different gene expressions. [15][16][17] In analytical chemistry, PCA is central in the development of quantitative structure activity relationship (QSAR) models, of particular utility in the pharmaceutical industry. [18][19][20][21] Perhaps most closely related to this study is the use of PCA in computational biology, to capture essential motions of a protein in MD simulations. [22][23][24] There are, however, still some key limitations of PCA: rst, it is assumed that the relationships between features of the data points are linear. Second, principal components must be orthogonal to one another, so some types of coupled motions may not be well-described (i.e., related to the rst point, motions that are coupled in non-linear relationships). Third, because PCA aims to pick principal components along which the variance of the data is maximized, some shapes of the data distribution can end up being described poorly (e.g., two "bands" of data, or stacked "pancakes" of data points). [25][26][27] Despite these limitations, for the applications described herein, PCA does an excellent job of dening a reduced dimensional space, without losing too much structural information along the chemical pathways examined, and the issue of capturing non-linear motions can be mitigated by adjusting the representations of molecular structures that are input to PCA. Despite its utility and the fact that reaction coordinates of small-molecule systems are not as susceptible as those of larger systems to suffer from the aforementioned limitations, as far as we know, PCA is not commonly utilized for the visualization of small-molecule chemical change.
For computational studies of large biomolecular systems occurring over long timescales, a suitable choice of collective variables is necessary for modelling dynamics, and thus many dimensionality reduction techniques in addition to PCA have been explored in the eld. For example, in the realm of Markov state models, many in the computational community have chosen to employ time-lagged (or time-structure based) independent component analysis (TICA) 28 rather than PCA. TICA aims to maximize the autocorrelation for a given lag time, rather than the variance, and so is better able to resolve slow timescale events, which is better for capturing the slow dynamics of large molecules like enzymes. 26,29 Diffusion maps constitute a dimensionality reduction technique that does not assume the data points to be related linearly, but instead seeks to determine the manifold in which the data live. [30][31][32] For the small-molecule applications discussed below, where we are not considering very large systems occurring over large timescales and particularly because we are focusing on intrinsic reaction coordinates (IRCs) rather than MD trajectories to dene a reduced dimensional space, we chose to use PCA in order to determine the optimal reduced dimensional space for these example systems. The methods described herein are provided in an open-source soware package named PathReducer, which allows the user to decide whether their system is best described by linear combinations of Cartesian coordinates or squared interatomic distances, and also whether they would like these inputs to be mass-weighted prior to processing. The merits of all options as applied to several example systems are discussed in the results section, below.
In this paper, we have three principal goals. The rst is to introduce the application of PCA into the eld of smallmolecule computational chemistry, where its value may not have been as widely recognized as it has been in computational biology. The second is to show the utility of using PCA to analyze and characterize chemical pathway data. In particular, we show that a variant of PCA in which the input data are squared internal distances can have advantages over the version in which Cartesian coordinates are used. Additionally, by using a reduced dimensional space dened by an IRC and projecting MD trajectory data into this space, one can quickly classify the routes taken by trajectories compared to the minimum energy path. The third objective is to provide our code, PathReducer: an easy-to-use code for computational chemists to reduce the dimensionality of their molecular systems.

2.
PathReducer: dimensionality reduction software The methods described below are freely available in an open source Python package named PathReducer, with further details in the ESI. † While there are many dimensionality reduction packages already available in the scikit-learn 33,34 library in Python, the present soware is specically designed to process trajectories of small molecules and generate visualizations thereof. The RMSD Python package, which calculates the RMSD between structures and does alignments using a variety of possible methods, was also utilized in the making of this code for structural alignments using the Kabsch algorithm. 35 A owchart illustrating how PathReducer works is shown in Fig. 1.

Input
PathReducer takes as input the following: (a) A series of molecular geometries (e.g., an IRC, a trajectory, a relaxed potential energy surface scan) in xyz le format; (b) n dim , the number of dimensions for the low-dimensional space (oen two or three dimensions would be most useful for visualization); (c) Whether the user wants PCA analysis to be carried out on mass weighted input coordinates; (d) Optional labels of four atoms surrounding a stereogenic center of the molecule in order to dene chirality (this is only necessary when dening the molecular structures as squared interatomic distance matrices, discussed in more detail below); (e) The representation of the IRC/trajectory upon which the user wants to perform PCA. The user can specify that PCA be performed on the aligned Cartesian coordinates of the structures (keyword "Cartesians") or on the upper triangle of the squared interatomic distance matrices of the structures (keyword "Distances").
The full distance matrix representation is less suitable for very large systems as the size of the representations scales as OðN 2 Þ, with N being the number atoms, whereas the aligned Cartesian coordinate representation scales with OðNÞ. Using internal distances, however, provides a more accurate reduced dimensional representation in fewer dimensions when nonlinear motions (e.g. torsions) are involved in the reaction pathway. Additionally, the output from using interatomic distance matrices as input to PCA is more suitable for use in free energy sampling methods since the representation is rotationally and translationally invariant.

Pre-processing
Both methods have the option to mass-weight the Cartesian coordinates prior to processing by PCA, but mass-weighting must occur aer structural alignment. If the specied input is "Cartesians", the Cartesian coordinates of the structures are represented as 3N-dimensional vectors and aligned using the Kabsch algorithm (step 1C in Fig. 1). 36 If the user chooses to mass-weight, the Cartesian coordinates are at this point transformed according to the following equation: where x is the 3N-dimensional vector containing the massweighted coordinates for a single structure along the IRC/ trajectory, m N is the mass of atom N, and N is the number of atoms in the system (MW step in Fig. 1). If the specied input is "Distances", rather than using the 3N-dimensional aligned Cartesian coordinate vectors to represent each structure along the IRC, each structure is represented as a squared internal distance matrix with each element representing the squared Euclidean distance between an atom pair of the molecule, generating an (N Â N)-dimensional distance matrix for each input structure (step 1D in Fig. 1). Because each interatomic distance matrix is symmetric with its diagonal elements being zero, the upper triangle of each matrix can be attened to a vector of length NðN À 1Þ 2 containing all of the pairwise distances.

Processing
The data processing step (step 2 in Fig. 1) involves performing PCA on the [n Â 3N] or n Â NðN À 1Þ 2 ! -dimensional matrix of structures, n being the number of structures from the input xyz le. Because PCA is well-described in the literature, 25,27 we will only give a brief summary of the method here. PCA takes a set of n observations with p variables (in our case, n structures along an IRC/trajectory with 3N Cartesian coordinates or NðN À 1Þ 2 interatomic distances) and returns an orthogonal basis that maximizes the variance captured by the minimum number of principal components. This transformation is accomplished by a diagonalization of the mean-centered covariance matrix C to generate a new orthogonal coordinate system as follows: where U C is the matrix of eigenvectors, each of which represents a new coordinate that corresponds to a linear combination of the original variables, and L C is the diagonal matrix of the corresponding eigenvalues (l C ) of C. In this case, the principal components are linear combinations of Cartesian coordinates or squared interatomic distances. The corresponding eigenvalues correspond to the proportion of the total variance of the system that is captured by each eigenvector. The amount of variance captured by each eigenvector is contained in the eigenvector's corresponding eigenvalue, l C . What is oen referred to as the "goodness of t" (G.o.F.) or the "variance explained" by the reduced dimensional model corresponds to the sum of the eigenvalues of the number of eigenvectors used in the reduced dimensional space (that is, the fraction of variance captured by the n dim principal components chosen): 37

Reconstruction
The reduced-dimensional IRC/trajectory can then be transformed back into the original, full-dimensional space to reconstruct the effect of individual principal components on the molecular geometries using the following expression (step 3 in Fig. 1):X corresponding to weights of the ith principal component, and X is the mean structure of the original dataset. Similarly, the following expression is used to reconstruct the combined effect of the n dim principal components: where T i:n dim is the [n Â n dim ]-dimensional matrix of structures represented by all n dim principal components and W i:n dim is the ing the weights of the n dim principal components. If using a "Cartesians" input to PCA, this is the last step prior to generating output because the reconstructed structures are in Cartesian space. In the case of using the "Distances" input, the structures that have been transformed into reduced dimensional space at this point are still vectors representing the upper triangle of interatomic distance matrices, and so each row then needs to be converted from squared distances to Cartesian coordinates. 38 These steps represent the most computationally expensive part of the procedure, as a matrix diagonalization Fig. 1 A flowchart indicating how PathReducer works. The blue arrows/boxes represent the procedure used if the user specifies a "Cartesians" input to PCA and the red arrows/boxes represent the path taken with a "Distances" input specified. Black arrows/boxes are parts of the method shared by both input types. must be done for each molecular structure (step 7D in Fig. 1). The reconstruction of Cartesian coordinates from the attened, reduced dimensional distance matrices requires the following: each vector is converted back into a square, symmetric matrix with zeroes along the diagonal (step 4D in Fig. 1). The Gram matrix, G, for each internal distance matrix is then calculated by: where D represents the interatomic distance matrix and d 1 is the rst column of D (step 5D in Fig. 1). An eigenvalue decomposition (EVD) is then conducted on G (step 6D in Fig. 1) as follows: The approximate reconstruction of the Cartesian coordinates is given by the rst three columns of the matrix generated by taking dot product of the eigenvectors and the square root of their corresponding eigenvalues, It should be noted that because the reduced dimensional distance matrix, D, is not a true distance matrix, but rather what is referred to as a "predistance matrix", 39 there will be trailing values in the reconstruction matrix L 1/2 G U T G beyond the rst three columns that are a result of the fact that some structural information is lost by reducing the dimensionality of the system. If D was a true distance matrix, only the rst three columns of L 1/2 G U T G would be nonzero. Additionally, because information about the absolute rotational/reective conguration is also lost in representing each of the structures as internal distance matrices, these structures will be in an arbitrary rotational/reective conguration. For the sake of visualization, the Kabsch algorithm, 36 which determines the optimal rotation matrix to minimize RMSD between pairs of points, is used to align structures along the IRC.
Structures along the reconstructed pathway are reected if the chirality of the structure at a particular point is not consistent with the analogous structure in the original le (step   2 The IRC of the malonaldehyde system. Structures A, B, and C represent reactant, transition state, and product structures, respectively. In this and all similar plots below, purple represents the beginning of an IRC/trajectory and yellow represents the end. Fig. 1). The optional input of four atoms surrounding the stereogenic center are used to determine the chirality of the structure at each point by the method in ref. 40. The sign of the following fourth-grade determinant is used to assign the chirality of the structure:

7D in
where x i , y i , and z i represent the Cartesian coordinates of the four atoms surrounding the stereogenic center. This determinant will only be equal to zero when the four atoms used to assign the molecule's chirality are in the same plane. If coordinates were mass-weighted, mass-weighting of the coordinates is removed according to the following equation (step UMW in Fig. 1 where v PC i is the 3N-dimensional vector containing the Cartesian coordinates for a structure along the reaction pathway in PC i , x j is the jth component of the 3N-dimensional vector containing the mass-weighted coordinates for a single structure along the IRC/trajectory, m N is the mass of atom N, and N is the number of atoms in the system. Finally, structures along the reconstructed pathway are aligned using the Kabsch algorithm (step 8D in Fig. 1).

Output
PathReducer generates a total of (n dim + 1) xyz les from the Cartesian coordinates of the principal components (PCs): the n dim PCs individually transformed into the full-dimensional space, as well as the combination of all n dim PCs transformed Fig. 5 The IRC for the S N 2 system. Structures A, B, and D represent the reactant, transition state, and product structures (after the fluoride ion orbits the system to hydrogen bond with the hydroxyl group), respectively. Structure C represents the structure where PC2 is at a minimum (Fig. 6), which can be thought of as the structure of the system when the fluoride ion has fully dissociated from the methanol, but has not yet come back to hydrogen bond with the hydroxyl group.

Applications to chemical systems
To illustrate the output of PathReducer, we show four examples of systems on which we conduct dimensionality reduction. The rst two, "malonaldehyde" and "S N 2", are prototypical test systems that have been previously used by Tsutsumi et al. to illustrate their dimensionality reduction approach. 41,42 The third is a simple torsional rotation of N 2 O-appended acrylonitrile. The last example is the opening of substituted cyclopropylidene to generate chiral allenes. 43 The results discussed below utilize coordinates that were not mass-weighted. The mass-weighting option is included in case the user wants to dene a reduced dimensional space for which the calculated kinetic energy is not dependent on mass. As we were not interested in calculating kinetic energy in our reduced dimensional space, and because some of the systems below include hydrogen movements along the reaction coordinate that we did not want to be dwarfed by the movements of heavy atoms, we chose not to mass-weight the coordinates prior to PCA. Massweighting does change the results of the dimensionality reduction, as scaling the data on which PCA is conducted changes the reduced dimensional space. In terms of visualization of the pathway in the reduced dimensional space, mass-weighting will give precedence to the movement of heavier atoms; that is, heavier atoms will contribute more to the structural variance along the chemical pathway, which will be reected in the PCs.
For this reason, care should be taken when deciding whether or not it is appropriate to mass-weight the coordinates prior to PCA. Mass-weighting would not be appropriate, for example, when hydrogen movements play a large role in the chemical Fig. 7 Deformation vectors and geometries of structures A-D represented by PC1 and PC2 along the S N 2 IRC. The deformation vectors correspond to the atomic motions necessary for the current structure's geometry to form the following structure's geometry (i.e., structure A going to structure B, structure B going to C, etc.). Relative vector magnitudes within a frame are quantitative, but are only qualitative between frames (i.e., the magnitude of all vectors in a frame were adjusted by the same factor in order to increase clarity) and should be used as illustrative purposes only, as these vectors are dependent on the final alignment of the structures along the PC. For videos of these PCs, see https:// vimeo.com/335614633 for PC1 and https://vimeo.com/335614625 for PC2. pathway. See the ESI † for mass-weighted results for all of the example systems below.

Quantum mechanical methods for generating IRCs and trajectories
Gaussian 09 (ref. 44) was used to generate the example IRCs shown below. The malonaldehyde, S N 2, and cyclopropylidene IRCs were calculated using the MP2 method 45 with the 6-31+G(d,p) basis set. The MD trajectory for the S N 2 and cyclopropylidene bifurcation systems were calculated using the Born-Oppenheimer Molecular Dynamics (BOMD) functionality in Gaussian 09 at the same level of theory as their IRCs. It should be noted that while ab initio quantum chemistry methods were used to generate IRCs and MD trajectories in this case, this analysis is not specic to a particular type of calculation or level of theory. All that is needed as input to the method is one or more les containing molecular structures in xyz le format illustrating the transformation(s) of interest.

Malonaldehyde
Intramolecular hydrogen transfer between the two oxygens of malonaldehyde is one of the most studied systems in reaction dynamics, owing to the fact that the reaction coordinate is symmetric about the transition state structure, generating indistinguishable molecules. The IRC for this reaction, as well as reactant, transition state, and product structures, can be seen in Fig. 2. Fig. 3a and b show the results obtained when PathReducer is used to represent the structures along the malonaldehyde IRC as squared internal distance matrices that are input to PCA. Fig. 3c shows that the rst principal component (PC1) describes 87.0% of the variance, while PC2 accounts for 12.8%. As these components capture more than 99% of the total variance in the geometrical changes along the IRC, we conclude that the important molecular motions are captured by this twodimensional space. Performing PCA on the aligned Cartesian coordinates gives very similar results, which are shown in the ESI. † Fig. 4 shows that the most signicant principal component (PC1) corresponds to motion of the hydrogen atom between the two carbonyl oxygens and alternating single and double bond character of the two C-C bonds. The second most signicant principal component (PC2) corresponds predominantly to inward motion of the carbonyl oxygens, where the oxygens are farthest apart in the reactant and product structures and closest Fig. 9 The two possible pathways of N 2 O reacting with acrylonitrile. The ball-and-stick inset illustrates the dihedral angle rotation of the N 2 Oacrylonitrile complex that appears to be the primary geometric coordinate that differentiates the two transition state structures along the competing pathways. together at the transition state structure. For videos of these PCs, see https://vimeo.com/335614575 for PC1 and https:// vimeo.com/335614565 for PC2. See the ESI † for corresponding xyz les of these PCs.

S N 2 reaction between OH À and CH 3 F
Our second example is the S N 2 reaction between hydroxide ion and uoromethane, where hydroxide ion attacks the backside of uoromethane and releases a uoride ion (Fig. 5). Modelled in the gas phase, along the IRC, the uoride ion does not dissociate completely, but rather orbits the newly generated methanol until it nds a suitable location to hydrogen bond with the hydroxyl group. This is not, however, the most common scenario in MD trajectories. Only 10% of MD trajectories conducted by Tsutsumi et al. showed the uoride ion hydrogen bonding with the resultant methanol, while the other 90% had the uoride dissociating from the system completely. 42 In this system, with squared interatomic distances as input to PCA, PC1 accounted for 78.7% of the variance, PC2 for 14.5%, and PC3 for 4.9% (Fig. 6c, below).
Visualizations of the geometric changes along the top two principal components can be found in Fig. 7. PC1 represents a pathway that looks quite similar to the original IRC, where the uoride ion dissociates from methanol and then orbits around the molecule to interact with the hydroxyl group. For a video of PC1, see https://vimeo.com/335614633. PC2 represents an almost periodic motion (as can be seen in Fig. 6a, where PC2 starts at a maximum, reaches a minimum, and then returns near to the same maximum) of methyl group pyramidalization and O-H bond stretching. For a video of PC2, see https:// vimeo.com/335614625. Corresponding xyz les for these PCs can be found in the ESI. † New data for a system can also be projected into a dened reduced dimensional space. To illustrate this, a MD trajectory was initiated from the S N 2 system's transition state structure (structure B, Fig. 5-7) for 500 steps of 1 fs and propagated in the product direction. As was observed in most of the trajectories calculated by Tsutsumi et al., 42 aer dissociating, the uoride ion did not orbit the resultant methanol and hydrogen bond with the hydroxide group, but rather dissociated completely and did not re-associate for the duration of the trajectory (500 fs). As can be seen in Fig. 8, there is oscillatory movement with the amplitude in the direction of PC1 and almost linear movement in the direction of PC2. This oscillation reects the excess energy in the forming C-O bond vibration (reected in PC1) and progression along PC2 is consistent with the C-F distance increasing. Though this reduced dimensional space was dened only by the structures along the S N 2 IRC, it can be quickly seen from the projection of an MD trajectory in the reduced dimensional space that the dynamical path is very different than the IRC path. In addition to showing that MD trajectory paths can be very different from IRC paths, this example illustrates that PathReducer can be used as a straightforward way to classify reaction pathways generated by different types of molecular simulations. Plots of the results when using aligned Cartesian coordinates to represent the molecular structures can be found in the ESI † and look similar to those generated when using squared interatomic distances as input to PCA.

Torsions in the N 2 O-acrylonitrile complex
One of the biggest issues that was found in this study with using aligned Cartesian coordinates as input to PCA rather than interatomic distances is how poorly non-linear motions (e.g., torsions) are represented in individual principal components. To illustrate this point, we looked at the dihedral rotation around the C-O bond of a N 2 O-acrylonitrile complex. We chose this system as one that could be interesting to view in reduced dimensions because we posit that this rotation would be a geometric feature that could, in principle, discriminate between two possible reactive pathways: epoxidation or 1,3dipolar cycloaddition (Fig. 9). Fig. 10 shows the IRC projected onto the reduced dimensional space. In this case, two principal components are enough to describe over 99% of the variance in the system. However, using interatomic distances to represent the structures as input to PCA resulted in the rst principal component accounting for 93.3% of the variance in the system, whereas an aligned Cartesian coordinates representation of the structures meant the rst principal component only accounted for 82.0% of the variance. This result implies that interatomic distance matrices as input to PCA are better for handling torsions in a smaller number of principal components. Thus, if torsions are suspected to be one of the major types of geometric changes along the course of an IRC or trajectory, using the "Distances" input option is likely a better choice (though, if possible, both methods should be screened).
This point can be illustrated by examining the effects of the top principal components on the geometries along the Fig. 12 A comparison of the bifurcating reactions in (a) the previous study by Carpenter et al. 43 looking at the effects of chiral solvent on enantiomeric induction and (b) the current study. Both reactions involve the ring-opening of cyclopropylidene to generate enantiomeric allenes, but N 2 (in blue) was included as a leaving group in the previous study. acrylonitrile scan. When performing PCA on the aligned Cartesian coordinates, PC1 signicantly compresses the N 2 O moiety during the torsion in order to emulate the effect of a dihedral rotation, while this is not the case when using squared interatomic distances. This is particularly evident in the middle frames shown in Fig. 11a and b. Similarly, a squared interatomic distances representation more accurately preserves the bond distances of the N 2 O moiety (Fig. 11c). See https:// vimeo.com/336110236 for a video of PC1 using aligned Cartesian coordinates as input to PCA and https://vimeo.com/ 335614657 for PC1 using interatomic distances as input to PCA.

Post-transition state bifurcation in cyclopropylidene ring-opening
The nal example to illustrate the utility of this method is a system that exhibits a post-transition state bifurcation. 46,47 This particular system is the ring-opening of cyclopropylidene to generate chiral allenes, which follows up on a reaction previously studied by two of us, investigating the effects of explicit solvent on enantiomeric induction. In the previous study, the concerted, asynchronous transition state structure for the ring-opening event was preceded by N 2 departure from the carbene carbon, as depicted in Fig. 12. 43 The system sans N 2 was chosen to focus on the structural changes along the reaction coordinate of the carbon skeleton (including uorines).
Systems with post-transition state bifurcations occur in cases where a single transition state structure connects a reactant to two separate products, without any intermediate minima or secondary barriers along the downhill path to either product. If one were to take the upper saddle point structure on the PES as the transition state structure and follow the steepest descent path in the reactant and product directions, where two products are related by symmetry (e.g., enantiomers) the steepest descent path on the product side would pass by a valley-ridge inection (VRI) point before reaching a minimum. In the case of unsymmetrical bifurcations, there would not be a VRI, but still an additional exit channel with no intervening minima or barriers to overcome. In either case, the IRC would not illustrate the connection between the saddle point and the second possible minimum, as, mathematically, there can only be one steepest descent path. We chose this system to test as input to PathReducer because bifurcating reactions represent a class of chemical change whose dynamics are oen important, but which have very rarely been visualized using actual structural data and are more oen illustrated on qualitative surfaces that illustrate the location of a VRI. 46,[48][49][50][51][52][53][54] While an IRC calculation necessarily picks a single pathway as the minimum energy path, a "bifurcating" IRC could in this   case be constructed by a 180 torsion about the C 1 -C 3 axis (see Fig. 13) for each structure following the branching point. Note that reecting each point along the IRC aer the point where the paths split would articially change the atom labels and would cause the distance matrices for the pathway to each product to be identical, and thus would not be able to show the paths splitting. To avoid this, we keep the atom labels consistent with those that would be obtained by a torsional rotation. Fig. 14 illustrates that representing structures along symmetric bifurcating reaction paths using interatomic distance matrices does a good job of illustrating the path "splitting" before leading to the two possible products, whose locations are shown by the yellow ends of the paths. The top three principal components account for 77.6%, 11.8%, and 10.0% of the variance in the IRC, respectively. The equivalent plot using the "Cartesians" input to PCA can be found in the ESI. † As with the S N 2 system, MD trajectories for the cyclopropylidene bifurcation were initiated from the transition state structure and propagated in the product direction. Fig. 15 shows these trajectories projected into the reduced dimensional space dened by the bifurcating IRC. The MD trajectories do not follow the IRC path very closely, indicating that dynamic properties of molecules should not be deduced from IRCs alone. Assigning the product made in each case (if a product is even made) is not entirely straightforward, as illustrated in the original trajectory videos (found at https://vimeo.com/336131095 for trajectory A, https://vimeo.com/336131066 for trajectory B, https://vimeo.com/ 336131042 for trajectory C, and https://vimeo.com/336131137 for trajectory D. See the ESI † for corresponding xyz les). However, projecting these trajectories into the reduced dimensional space dened by the IRC enables rapid qualitative insights into the routes taken by any particular trajectory. Fig. 15a shows a trajectory in which the cyclopropyl ring opens but lingers in the bifurcation region without committing to a clear product pathway. Fig. 15b and c show trajectories which are heading toward generating a single product (enantiomer 2). Fig. 15d is rather different: it goes along the pathway toward enantiomer 1 before traversing the region between the two possible products, a consequence of the fact that the trajectories illustrated in Fig. 15 are run in the gas phase at a constant total energy (NVE ensemble). Therefore, once the molecule goes down the potential energy "hill" aer the transition state structure, the molecule has signicant excess energy with nowhere to dissipate, which enables interconversion between different product states through high energy geometries.
Seeing MD trajectories projected into a reduced dimensional space dened by an IRC in this way offers a unique perspective on the utility of IRCs compared to MD simulations. While MD trajectories arguably model real, room temperature reactions more accurately by including the effects of nite energy and temperature, this kinetic energy adds noise to the pathway from reactant to product(s). An IRC, however, shows the minimum energy pathway from reactant to product(s); viewed another way, the IRC is the minimum atomic motion necessary for a transformation. In this sense, the IRC provides a sort of "skeleton" characterizing the transformation of interest, which is very useful to aid in product classication of MD trajectories. Dening a reduced dimensional space based on an IRC and projecting MD trajectories into this space offers a simple and efficient way to characterize the pathways of MD trajectories in a quantitative comparison to the IRC.

Conclusions and future work
In conclusion, we have generated a procedure and written soware for dimensionality reduction of reaction pathways that is generalizable and can handle specic chemical problems (e.g., torsions and bifurcations). For several examples, we were able to show that this method can reduce the dimensionality of a complex chemical system to a much smaller number of dimensions. For all of the applications outlined herein, two or three dimensions was sufficient to reconstruct the reaction pathway without losing too much information about the structural variance. The principal components generated as a result of this dimensionality reduction method are linear combinations of (potentially mass-weighted) aligned Cartesian coordinates or interatomic distances. For the example systems described, the interatomic distances representation of structures was better than aligned Cartesian coordinates to describe non-linear structural movements, such as torsions. In the future, we plan to use this methodology to choose collective variables to be used in free energy sampling workows such as metadynamics or boxed molecular dynamics (BXD). 12 We will also analyze various different types of trajectories [e.g., MD trajectories incorporating explicit solvent, non-adiabatic MD trajectories, gas-surface scattering MD trajectories, usergenerated pathways from interactive molecular dynamics in virtual reality (iMD-VR)]. Finally, we would also like to make the code for this method more efficient in order to be better able to analyze enzyme-substrate systems, as similar methods of describing proteins as internal distance matrices have already been utilized. 23,55 Our hope is that PathReducer will prove useful for mapping out reaction pathways, as an alternative to relying on chemical intuition to determine geometric changes that are most important along an IRC or trajectory. While improvements are ongoing, we are condent in the broad utility of dimensionality reduction of chemical systems and believe it has the potential to form a useful tool for molecular analysis within the whole of the molecular simulation community.

Conflicts of interest
There are no conicts to declare. thanks The Alan Turing Institute under the EPSRC grant EP/ N510129/1. DRG acknowledges funding from the Royal Society as a University Research Fellow, and also from EPSRC grant EP/ M022129/1.