Yifei
Zhu
^{ab},
Jiawei
Peng
^{bc},
Xu
Kang
^{ab},
Chao
Xu
^{ab} and
Zhenggang
Lan
*^{ab}
^{a}SCNU Environmental Research Institute, Guangdong Provincial Key Laboratory of Chemical Pollution and Environmental Safety, School of Environment, South China Normal University, Guangzhou 510006, P. R. China. E-mail: zhenggang.lan@m.scnu.edu.cn
^{b}MOE Key Laboratory of Environmental Theoretical Chemistry, South China Normal University, Guangzhou 510006, P. R. China
^{c}School of Chemistry, South China Normal University, Guangzhou 510006, P. R. China
First published on 14th September 2022
The analysis of the leading active molecular motions in the on-the-fly trajectory surface hopping simulation provides the essential information to understand the geometric evolution in nonadiabatic dynamics. When the ring deformation is involved, the identification of the key active coordinates becomes challenging. A “hierarchical” protocol based on the dimensionality reduction and clustering approaches is proposed for the automatic analysis of the ring deformation in the nonadiabatic molecular dynamics. The representative system keto isocytosine is taken as the prototype to illustrate this protocol. The results indicate that the current hierarchical analysis protocol is a powerful way to clearly clarify both the major and minor active molecular motions of the ring distortion in nonadiabatic dynamics.
Within the trajectory-based on-the-fly nonadiabatic dynamics, numerous trajectories must be propagated until the statistical convergence is fulfilled. In principle, the statistical analysis of these trajectories gives important dynamics features, such as the excited-state lifetime, the branching ratio of reaction channels, the dominant molecular motions and so on. Moreover, for a direct view to understand the molecular evolution in the on-the-fly nonadiabatic dynamics, it is extremely crucial to identify a few key active coordinates that drive the photoinduced reactions. However, such analysis is not a simple and trivial task.^{22} In traditional analyses, several approaches were normally employed, for instance directly checking the differences between the hopping and initial geometries with human observation, examining the geometric evolution in the dynamics by eyes, and plotting the relevant internal coordinates as a function of time. The difficulty to perform such traditional analyses dramatically increases when the system size becomes larger, the complicated molecular motions are involved and a large number of trajectories are calculated. Alternatively, the normal mode analysis was used to analyse the geometric evolution of nonadiabatic dynamics.^{23–25} As a powerful tool, such analyses provide valuable physical insights into molecular vibrations when small-amplitude molecular motions are involved.
Actually it is difficult to extract the underlying structural evolution from extremely complex motions in the high-dimensional geometric space. But in the perspective of machine learning,^{26–28} it may be helpful to transform a high-dimensional data set to a low-dimensional reduced one, with the goal of preserving the main information contained in the original data set. Based on this reduced-dimensional space, it is preferable to employ clustering methods and other tools to obtain a transparent view of the data distribution patterns. Within this framework, in principle it is possible to introduce unsupervised machine learning algorithms into the analysis of the molecular dynamics, to effectively and accurately extract important dynamics features from the large amount of data given by trajectory simulations. Along this idea, considerable efforts were devoted to the analysis of the ground-state molecular dynamics results.^{27,29–38}
In the meantime, the analysis of the simulation results of nonadiabatic dynamics with unsupervised learning algorithms is still challenging and only a few efforts were made in recent years.^{39} The diffusion map was used by Virshup et al. to perform the dimensionality reduction analyses of the photoisomerization dynamics with ab initio multiple spawning simulations.^{40} The same dimensionality reduction approach was used by Belyaev et al. to understand the geometric evolution in the TSH nonadiabatic dynamics.^{41} Li et al. employed two closely related dimensionality reduction approaches, classical multidimensional scaling (MDS) and isometric feature mapping (ISOMAP) methods, as well as the density-based spatial clustering of applications with noise (DBSCAN) clustering approach to analyse the geometric evolution in the nonadiabatic dynamics.^{42,43} Principal component analysis (PCA) was also used by Capano et al.^{44} and Peng et al.^{45} to analyse the photophysics of the Cu-complex in the TSH dynamics and the role of the bath motion in the symmetrical quasi-classical dynamics method based on Meyer–Miller mapping Hamiltonian, respectively. The combination of the normal mode and PCA was employed to understand the key motion of the nonadiabatic dynamics by González and coworkers.^{23} In addition, some efforts were also made to employ the unsupervised machine learning algorithms to analyse the nonadiabatic dynamics of solid state systems.^{46,47} Recently, the unsupervised machine learning algorithms were also applied by Choi et al. to understand and propagate the dynamics evolution of open quantum systems.^{48}
The deformation of an aromatic ring widely exists in many nonadiabatic dynamics processes, including the photostability of the DNA bases,^{49–54} the internal conversion of the sunscreen molecules,^{55,56} the “channel-III” nonradiative process of benzene and its simple derivatives,^{57,58}etc. Therefore, it is often necessary to find an appropriate way to characterize the ring deformation. Barbatti et al. once used the Cremer–Pople diagram^{59} to clarify the ring distortion in the TSH dynamics.^{52,60,61} The classification by Boeyens also provided a suitable description of the six-membered ring deformation.^{62} Certainly, it is fascinating to check how to employ the unsupervised machine learning algorithms to analyse the ring distortion. In fact, unsupervised learning methods were conducted by Cersonsky et al. to analyse the ring distortion of aliphatic cyclohexane in the metadynamics.^{63} The analysis of the ring deformation may not be straightforward, because we have to find a suitable way to obtain the balanced description of the major and minor active ring-part and side-group DOFs involved in the complicated molecular dynamics.
In the current work, we are committed to develop a hierarchical protocol based on unsupervised machine learning algorithms for automatically identifying different photoreaction channels and their corresponding critical molecular motions from the on-the-fly TSH dynamics simulations. As we expected, this hierarchical protocol can address the aforementioned difficulties to characterize the ring distortion. Here, the PCA^{27,29,30,44,45,64,65} and two clustering methods (DBSCAN^{66} and agglomerative clustering^{67,68}) were used to perform analyses in the protocol. It is clear that the ring deformation may not be well captured by the Cartesian coordinates since such distortions generally display highly nonlinear features. To avoid the dilemma posed by Cartesian coordinates, we proposed to employ six descriptor sets constructed from different groups of redundant internal coordinates based on chemical meanings. The whole analysis protocol includes two stages. First, to identify the reaction channels, we analysed the DOFs belonging to the ring part and end groups successively by performing the PCA and clustering approaches of hopping geometries in the TSH dynamics in a hierarchical manner. In this step, several disjoint clusters were obtained, and in principle each cluster should represent a reaction channel. Second, to identify the major active coordinates of each channel, we compared the hopping geometries with the corresponding initial ones in each descriptor space by employing the PCA.
In this work, we selected the keto isocytosine (Table 1) as a prototype model to demonstrate the above proposed hierarchical protocol. As a tautomer of the cytosine (one of the DNA bases), the photoinduced processes of the keto isocytosine were widely studied in both experimental and theoretical works.^{69–77} These previous works indicated that several conical intersections (CIs) are involved in the nonadiabatic dynamics, which are governed by different ring deformation patterns.^{75,76} Therefore, the keto isocytosine is an ideal model to verify the performance of our hierarchical protocol in realistic systems. Through the analyses based on the proposed hierarchical protocol, six reaction channels and their main active coordinates relevant to the ring deformation were identified. At the same time, even the minor active molecular motions were noticed. All these channels were correlated with the optimized CIs and the further analysis provides the key physical insight into the nonadiabatic dynamics. This suggests that the current hierarchical protocol provides a powerful way to analyse the ring deformation in the nonadiabatic dynamics. Furthermore, although this work was performed on the basis of the TSH nonadiabatic molecular dynamics simulation, a similar idea can be generalized to understand other types of on-the-fly dynamics simulations.
Molecular structure and atomic labels | ||||
---|---|---|---|---|
Descriptor sets | ||||
D _{ring} | D(4,1,5,3) | D(5,1,4,10) | D(10,2,3,5) | D(3,2,10,4) |
D(2,3,5,1) | D(1,4,10,2) | |||
A _{ring} | A(4,1,5) | A(3,2,10) | A(2,3,5) | A(1,4,10) |
A(1,5,3) | A(2,10,4) | |||
R _{ring} | R(1,4) | R(1,5) | R(2,3) | R(4,10) |
R(3,5) | R(2,10) | |||
D _{eg} | D(6,1,5,3) | D(6,1,4,10) | D(1,4,10,9) | D(3,2,10,9) |
A _{eg} | A(4,1,6) | A(2,10,9) | A(5,1,6) | A(4,10,9) |
R _{eg} | R(1,6) | R(9,10) |
This paper is organized as follows. Section II focuses on theoretical methods, implementation and computational details. Section III provides the applications of the current proposed methods in the analysis of the nonadiabatic dynamics of keto isocytosine. Finally, Section IV gives the conclusion of this work.
Although the Cartesian coordinates are used in the nuclear propagation in the on-the-fly simulation, this set of coordinates is not a good descriptor due to obvious limitations, i.e., the lack of translational and rotational symmetries and so on. Therefore, we need to transform the Cartesian coordinates into other sets of coordinates (commonly referred to as “descriptors” or “fingerprints”), which provide the appropriate description of the geometry evolution in the nonadiabatic dynamics.
The main purpose of this work is to analyse the ring deformation in the on-the-fly TSH nonadiabatic dynamics simulation. It is rather feasible to use internal coordinates to construct descriptors, since the bond lengths, bond angles and dihedral angles provide a rather compact and appropriate characterization of the ring deformation.
In this work, several points must be considered in the practical implementation:
(I) Different internal coordinates span in different numerical ranges; thus, they do not have the same scale. Furthermore, the internal coordinates form an over-completed and non-orthogonal space, in which different DOFs are correlated. For example, a simple out-the-plane motion of an end group may bring changes in both dihedral angles and bond angles. Here, we divided the bond lengths, bond angles and dihedral angles into different groups, and constructed several groups of descriptors.
(II) For each of the above three groups, we also divided all involved coordinates into two subgroups, which are either relevant to the motion of the ring part or the end-group part. Such a division is necessary, as the former and later sets give the reasonable descriptions on the ring distortion and the end-group motion, respectively. In this way, two types of motions are not mixed, and this way largely reduces the analysis difficulty.
(III) The H atoms generally display the large-amplitude motions owing to their lightness. In this case, it is easy to overestimate their contribution to the nonadiabatic dynamics. Because we mainly focus on the ring deformation, many internal coordinates associated with hydrogen-atom motions were not very essential. To avoid emphasizing the H-atom motions too much, we simply do not include them in the analysis.
We performed all analysis based on a special internal coordinate set, that is “redundant internal coordinates”.^{78,79} The basic rules to build these coordinate sets were discussed by previous studies in detail.^{78,79} Here, we only mention them briefly. The redundant internal coordinate set is composed of all pairwise bond distances, the appropriate bond angles and well-defined dihedral bond angles. The construction of the redundant internal coordinates can be realized as follows. First, all atomic distances are examined to determine whether two atoms are bonded to each other according to the clearly defined distance criteria.^{78,79} If yes, these two atoms are considered as bonded and their distance is considered as a valid bond distance. Second, the bond angles are assigned only when two atoms bonded to the same third atom. For example, A(C2,C3,N5) in Table 1 is defined as an appropriate bond angle when C2 is bonded to C3 and C3 is bonded to N5. Third, the valid dihedral angles are assigned to the situation that four atoms are bonded sequentially. For example, D(C1,N4,C10,O9) in Table 1 indicates that C1 is bonded to N4, N4 is bonded to C10 and C10 is bonded to O9. Following the above rules, a well-defined redundant internal coordinate set is obtained. In other words, this over-completed coordinate set does not change when the connectivity of a certain molecule remains unchanged. The chosen redundant internal coordinates are employed as default by the Gaussian package.^{79,80} The similar coordinate systems are used in the field of quantum chemistry.^{81–83} More discussions on the chosen redundant internal coordinates are given in the ESI.‡
In this work, the redundant internal coordinates are divided into either “ring” or “eg” group. If all related atoms are included in the ring, this internal coordinate is assigned to be in the “ring” group. For an internal coordinate, if some involved atoms are in the ring while others do not directly belong to the conjugated ring, it is labelled as the “eg”. For example, the molecular ring is formed by C1, N4, C10, C2, C3 and N5 in isocytosine; therefore, A(N4,C1,N5) and D(C1,N4,C10,C2) belong to the “ring” set, while R(O9,C10) and D(N6,C1,N4,C10) are divided into the “eg” group. Since only the ring deformation is mainly concerned, we did not consider the coordinates if all involved atoms are not in the ring moiety.
The current division strategy provides a basic idea rather than decisive division criteria, that is, we can divide the redundant internal coordinate space into different subspace according to different numerical ranges and chemical meanings. For example, in the current division, we may individually examine the different geometric evolution features in the ring parts and side groups connected to the ring moieties. If these two kinds of motions are decoupled or fall in different numerical ranges, the current analysis is very useful. Even if they are coupled, we still can treat them individually first and place them together at the end.
Nevertheless, the PCA can only be used in the linear data distribution. When the data are distributed on a manifold, the nonlinear dimension reduction methods, such as the diffusion map,^{40,84,85} ISOMAP^{27,29,42,86,87} and t-SNE (t-distributed stochastic neighbor embedding),^{88,89} become necessary. However, in many situations, the PCA is still among the first choices due to several advantages. For example, it provides a clear mathematical view of the data set distribution and explicitly generates the reduced coordinates. Thus, the PCA was chosen in the current work.
It is well known that the choice of the clustering approach strongly depends on the analysis purpose and the data distribution profile; thus, it is necessary to choose the suitable clustering method according to the data distribution patterns.^{27,28,90} When the data distribution clearly displays several distinct clusters and a few noise points, the DBSCAN method should be selected.^{91,92} If the data density in each cluster shows a large difference, the agglomerative clustering approach should work better.^{68,93,94} More detailed discussions on these clustering approaches are found in the ESI.‡
1. We performed on-the-fly Tully's TSH simulations. After collecting all geometries at hops, we built the database of the internal coordinate for these structures. We divided all internal coordinates into several descriptor groups and each represents a certain kind of geometric features.
2. We attempted to clarify how many reaction channels exist. Based on each descriptor set, the PCA and clustering analysis were performed for all hopping geometries. The iterative procedure is employed until each of the obtained cluster is non-separable, which should in principle represent a single reaction channel.
3. We attempted to identify the major active coordinates responsible for each nonadiabatic decay channel by comparing the initial structures and the hopping geometries in each channel. Such a comparison was performed based on the PCA.
Note that the above procedure is iterative and hierarchical, and more implementation details are discussed below.
The on-the-fly TSH dynamics simulation with Tully's fewest-switches surface hopping (FSSH) algorithm^{11} was performed at the SA3-CASSCF(12,9)/6-31G* level using the JADE code^{16,76,95} interfaced with the MOLPRO package.^{96} The decoherence correction proposed by Granucci et al.^{18} is introduced on top of the TSH algorithm with the correction parameter C = 0.1 hartree.^{97} All these calculation setups are very similar to our previous work.^{76} More details on the TSH simulations are given in the ESI.‡
Firstly, we considered the DOFs in the ring moiety only and performed the PCA using three descriptor sets (D_{ring}, A_{ring} and A_{ring}). This gives the data distribution patterns in three individual reduced spaces. For each of them, the clustering analysis was then conducted. In some cases, several clusters are obtained while in other cases we only obtain a single cluster. Sometimes, the symmetry property, that is, mirror symmetry, should be taken into account here.
For each individual cluster, we re-performed the PCA and applied the clustering analysis in the reduced space again, still based on the descriptor sets associated with the ring moiety. Such an iterative step was repeated until all clusters are non-separable. Here, if the separable clusters appear in two descriptor sets, we will take the case which gives the more clear cluster boundary. After this step, all hopping geometries were well analysed by the appropriate consideration of the DOFs in the ring moiety.
After the generation of several non-separable clusters by the analysis of the ring motion, we wished to separate the hopping geometries again based on the DOFs in the end groups. For each non-separable cluster obtained above, we performed the PCA again for each individual cluster based on the descriptor sets involving the end groups (D_{eg}, A_{eg} and A_{eg}). After the PCA, the clustering analysis was applied to give some sub-clusters. After some iterations, several non-separable clusters were obtained. In principle, each cluster should correspond to a certain reaction channel.
In the above procedure, both DBSCAN and agglomerative clustering methods were employed according to the data distribution.
Starting from a single cluster, we collected the hopping geometries of this cluster together with their corresponding initial sampling structures. Next, the PCA was performed based on six descriptor sets, and the reduced low-dimensional space provided a direct view of data distribution patterns. Within the reduced spaces constructed from different molecular descriptors, we noticed that there are two types of data distribution patterns. In some situations, the initial geometries overlap with the hopping geometries, implying that the chosen descriptor set does not play the active part in the nonadiabatic dynamics. In other cases, the initial geometries are well separated from the hopping geometries, and at the same time, a few leading components contribute significantly to the variance of the data distribution. This indicates that the corresponding reduced coordinates can well capture the geometric difference between the initial and hopping geometries. In this way, the active coordinates for the chosen channel were identified. After the similar analysis procedure was applied for each reaction channel, all relevant active DOFs were found.
For each non-separable cluster, we attempted two approaches to obtain the representative hopping geometry. In the first one, the typical geometry was defined as the cluster center, which was obtained by the following approach. For a given cluster, we computed all pair-wise distances in the reduced space obtained from the PCA. For a data point, we calculated the summation of all inter-point distances of this point. If a data point gives the minimum of this summation, we chose it as the cluster center. In the second method, we calculated the average values of the internal coordinates for all geometries within the selected cluster to build an averaged molecular structure. These two ideas provide visible ways to capture the major geometric features of the given cluster.
All analysis codes were written with Python language, and some of them were developed based on Scikit-learn Python toolkit,^{99} for example, PCA, DBSCAN and agglomerative clustering.
As shown in Fig. S1 (ESI‡), the population of the S_{2} state disappears very quickly, along with the accumulation of the S_{1} state in the early stage of the dynamics. Subsequently, the trajectories begin to jump back to the S_{0} state, and the population of the ground electronic state increases. Totally, we considered 1000 trajectories, of which 436 trajectories undergo S_{1}–S_{0} “hops” within 1.5 ps. The overall population dynamics is similar to our previous work.^{76}
We collected all geometries at the first S_{1}–S_{0} hop events and constructed the descriptors to characterize their geometric features. The internal coordinates were built using the Gaussian package according to Table 1. Whereas the internal coordinates of seven hopping geometries are inconsistent with those of initial geometries, because the six-member ring breaks in the excited-state dynamics. We just discarded them in the further analysis because of their minor contribution (<2%). In this way, a database (DB) containing all hopping geometries was constructed, in which each geometry is represented by a group of geometric descriptors, i.e., D_{ring}, A_{ring}, R_{ring}, D_{eg}, A_{eg} and R_{eg}, as discussed in Section II.2.3.
Next, we performed the clustering analysis on top of the PCA results of the D_{ring} coordinates, because of the existence of a few clearly separated clusters. As these three clusters have obviously different densities, we chose the agglomerative clustering method as discussed in Section II.1.1. Three clusters were identified, which are labeled as Clusters A′1, A′2 and A′3 in Fig. 1(d). The centers of these three clusters were taken to obtain a rough view of their typical geometric features. We noticed that the chiral symmetry may play some roles between the geometries of Cluster A′1 and Cluster A′3, because both upward and downward D_{ring} puckering deformations may exist in the nonadiabatic dynamics starting from the initial planar structures. In order to eliminate the chiral effect, we simply reversed the signs of the z-coordinates of the geometries in Cluster A′3. After this mirror operation, we rebuilt the database DB and re-performed the PCA and clustering analysis. Now two clusters were obtained, which are labeled as Clusters A1 and A2 in Fig. 1(e).
The next task is to clarify whether each individual cluster (Cluster A1 or A2) is separable by following the protocol in Section II.2.4. Since Cluster A2 contains 31 geometries (<15%), we considered that the Cluster A2 is non-separable. We only re-performed the PCA and clustering analysis for all geometries in Cluster A1 and found such a cluster to be not separable. Therefore, all hopping geometries were well separated into two groups, Clusters A1 and A2, by the PCA and clustering analysis of the ring-moiety motions.
As shown in Fig. 1(g) and (h), only one single data cluster is obtained in the cases when A_{eg} and R_{eg} were considered. In contrast, three clusters with almost identical densities are clearly given in the PCA results based on D_{eg}, beside a few outliers, as shown in Fig. 1(f). And the most important components of RCI and RCII are D(1,4,10,9) and D(3,2,10,9), respectively (see Fig. S6(A1)–(A4), ESI‡).
Next, we performed the further clustering analysis of all geometries in Cluster A1 on top of the D_{eg}-based PCA results. The DBSCAN method was taken here because it works well when we want to separate several clusters with almost identical densities and to remove some outliers, as discussed in Section II.1.1. The clustering analysis gave three clear clusters and 16 noise points (<5%) in Fig. 1(i). Because all three clusters are sub-clusters of Cluster A1, they are reasonably labelled as Clusters A1B1, A1B2 and A1B3. They contain 308, 42 and 32 data points, respectively. To date, we have totally obtained four individual clusters (Cluster A1B1, Cluster A1B2, Cluster A1B3, and Cluster A2). Among them, only the number of hopping geometries in Cluster A1B2 is more than 15% of the total, and thus the other clusters are considered to be non-separable.
To further examine whether Cluster A1B2 is separable or not, we re-performed the PCA and clustering analysis of all hopping geometries in this group again. The results are given in Fig. 1(j)–(l). Among them, the PCA of D_{eg} gives different results. The first leading reduced dimensional coordinate is extremely dominant, more than 70% variance, as shown in Fig. S6(B2) (ESI‡). In the reduced two-dimensional space given by the PCA of D_{eg}, we still obtain three clusters. Although the boundaries of these clusters are not perfectly clear, it is enough to identify these clusters by their different densities. Thus, we attempted to perform the agglomerative clustering here, and three clusters were shown, namely Clusters A1B2C1, A1B2C2 and A1B2C3, in Fig. 1(n). Here, these groups contain 232, 50 and 26 hopping geometries, respectively. Therefore, the later two clusters (Clusters A1B2C2 and A1B2C3) are treated as the non-separable ones in our view.
We re-performed same the PCA and clustering analysis for all geometries in Cluster A1B2C1, while this cluster is no longer separable. Overall, when both the ring-moiety and end-group parts were considered, we finally obtained six groups, i.e., Clusters A1B1, A1B2C1, A1B2C2, A1B2C3, A1B3 and A2.
In the above analysis process, sometimes the unclear clusters may appear in the PCA of some descriptor sets, such as the situation in Fig. 1(b). In such cases, we always examined which descriptor set gives more clear separation of all clusters in the reduced space. If some DOFs show better performance, such as D_{ring} in Fig. 1(a), we will simply take the corresponding descriptor set to conduct the further analysis.
For each cluster, we first collected all hopping structures and their corresponding initial sampling geometries to obtain an extended database. Each data point in the extended database is also represented by six descriptor sets given in Section III.2. Specially, here we took the absolute value of all dihedral angles in D_{eg} to avoid the inherent numerical discontinuity. We performed the PCA based on these descriptor sets, except R_{eg} that only includes two elements (R(9,10) and R(1,6)). All results are shown in Fig. 2–7 and each figure is composed of 24 subplots. Among them, the data distribution in the two-dimensional space spanned by two leading reduced coordinates, the relationship between RCI and a few most important internal coordinates, the variance ratio of each dimension and the components of a few leading reduced dimensions in the PCA are given in subplots (a)–(t). For R_{eg}, we just directly analysed the space spanned by R(9,10) and R(1,6) in the subplot (u). In addition, the S_{0} minimum, the typical hopping structure and the averaged one in the corresponding cluster are shown in subplots (v)–(x) in each figure. In all cases, the latter two geometries give the roughly similar structural features.
By the analysis of the end-group motion, A(2,10,9) and A(4,10,9) play certain roles in the nonadiabatic dynamics. At the same time, D(6,1,4,10), D(6,1,5,3) and D(3,2,10,9) contribute to the separation of the hopping geometries from the initial structures.
In this channel, the puckering of the C1 atom and the puckering at the C10 site in opposite directions are the dominant molecular motion in the nonadiabatic dynamics. Overall, D(4,1,5,3), D(5,1,4,10), A(1,4,10), A(1,5,3), R(1,5), R(2,3) and R(2,10) in the ring part, and D(6,1,4,10), D(6,1,5,3), D(3,2,10,9), A(2,10,9) and A(4,10,9) relevant to end groups are the leading active coordinates for this channel. The variations of D(5,1,4,10), D(4,1,5,3) and several relevant angels indicate that ring puckering takes place near the C4–C1–C5 region. The changes of angles A(2,10,9) and A(4,10,9), bond lengths R(2,10) and others within conjugated part, as well as the relevant dihedral angles, imply the existence of the C10-puckering of the ring part and the out-of-plane motion of the CO bond. These findings are consistent with the geometric features of the typical and averaged hopping structures given in Fig. 2(w) and (x).
Briefly, the main active coordinates in this A1B2C1 channel include D(5,1,4,10) and D(4,1,5,3), A(1,4,10), A(1,5,3) and A(4,1,5), R(1,5), R(2,3) and R(3,5), which indicate the C1-puckering motion, and D(6,1,4,10) and D(6,1,5,3) relevant to the downward out-of-plane motion of the C1–N6 bond.
As a short summary, the dihedral angles D(5,1,4,10) and D(4,1,5,3), the bond angles A(1,4,10), A(1,5,3) and A(4,1,5), and the bond lengths R(1,5), R(2,3) and R(3,5) associated with the ring moiety, D(6,1,5,3), D(6,1,4,10), A(2,10,9) and A(4,10,9), in the end-group part are the major active coordinates of the A1B2C2 cluster. And the dominant motion of this channel is the C1-puckering motion.
Overall, the C1 puckering motion and the out-of-plane motion of the amino group, accompanied by the CN bond stretching motion, are the important motions in this A1B2C3 channel. The dihedral angles D(5,1,4,10), D(4,1,5,3), D(6,1,4,10) and D(6,1,5,3), the angles A(1,4,10), A(2,3,5), A(4,1,5), A(4,1,6), A(4,10,9) and A(2,10,9), and the bond length R(1,6) contributing to these certain motions are important active coordinates.
Thus, the main active coordinates of this channel are the dihedral angles D(5,1,4,10), D(4,1,5,3), D(6,1,4,10), and D(6,1,5,3) and the bond angles A(1,4,10), A(4,1,5) and A(2,3,5). The C1-puckering and the out-of-plane motion of CO contribute mostly to this reaction channel, and these two motions follow the same orientation.
Channel | Important motion | Major active coordinates |
---|---|---|
a Taking the ring moiety as the reference and the direction of the C1-puckering motion as the upward one, [+] indicates an upward motion while [−] denotes a downward one. | ||
C1-puckering | Ring part: | |
C10-puckering | D(5,1,4,0) D(4,1,5,3) A(1,4,10) | |
CO out-of-plane motion[−]^{a} | A(1,5,3) R(1,5) R(2,3) R(2,10) | |
End-group part: | ||
D(6,1,4,10) D(6,1,5,3) D(3,2,10,9) | ||
A(2,10,9) A(4,10,9) | ||
C1-puckering | Ring part: | |
NH_{2} out-of-plane motion[−]^{a} | D(5,1,4,0) D(4,1,5,3) | |
A(1,4,10) A(1,5,3) A(4,1,5) | ||
R(1,5) R(2,3) R(3,5) | ||
End-group part: | ||
D(6,1,4,10) D(6,1,5,3) | ||
C1-puckering | Ring part: | |
D(5,1,4,0) D(4,1,5,3) | ||
A(1,4,10) A(1,5,3) A(4,1,5) | ||
R(1,5) R(2,3) | ||
End-group part: | ||
D(6,1,4,10) D(6,1,5,3) | ||
A(2,10,9) A(4,10,9) | ||
C1-puckering | Ring part: | |
NH_{2} out-of-plane motion[+]^{a} | D(5,1,4,10) D(4,1,5,3) | |
A(4,1,5) A(2,3,5) A(1,4,10) | ||
C1–N6 bond stretching | End-group part: | |
D(6,1,4,10) D(6,1,5,3) D(3,2,10,9) | ||
A(4,1,6) A(4,10,9) A(2,10,8) R(1,6) | ||
C1-puckering | ||
CO out-of-plane motion[+]^{a} | Ring part: | |
D(5,1,4,0) D(4,1,5,3) | ||
A(1,4,10) A(4,1,5) A(2,3,5) | ||
End-group part: | ||
D(6,1,4,10) D(6,1,5,3) | ||
CO stretching motion | Ring part: | |
A(2,10,4) A(1,4,10) | ||
R(2,3) | ||
End-group part: | ||
(4,10,9) R(9,10) |
All clusters (Clusters A1B1, A1B2C1, A1B2C2, A1B2C3 and A1B3) belonging to Cluster A1 play the dominant roles (92.5% of all hops) in the TSH dynamics, which are mainly governed by the C1-puckering motion of the ring moiety. For illustration, we align all representative geometries of these channels by attributing the C1-puckering motion as the upward motion (Table 2), and discuss the other key molecular DOFs. All representative geometries in Cluster A1B1, A1B2 and A1B3 show many similar geometric features.
Here, we noticed that the representative geometries in Cluster A1B1, A1B2 and A1B3 are in fact distinctive by different statuses of the C10 atom in the ring and its associated CO moiety. Their differences are clarified by examining two dihedral angles relevant to the O atom (D(1,4,10,9) and D(3,2,10,9)) (Fig. S6(A1)–(A4), ESI‡) and other relevant internal coordinates. We take the structure of Cluster A1B2 as the reference to illustrate. The downward and upward out-the-plane motions of the CO moiety are observed in the A1B1 and A1B3 decay channels, respectively, while the out-of-plane motion of the CO moiety is almost negligible in the A1B2 channel.
Three sub-clusters (A1B2C1, A1B2C2 and A1B2C3) exist in Cluster A1B2, which are characterized by different combinations of the out-the-plane motions of the NH_{2} group and the pyramidalization motion at the C1 atom. When the pyramidalization motion at the C1 atom is not obvious, we obtained the A1B2C2 channel. When the strong pyramidalization motions are observed, different pyramidalization directions result in two channels, A1B2C1 and A1B2C3. Cluster A2 accounting for 7.5% was also clarified. The CO stretching motion and the relevant ring deformation contribute to this channel, while the aromatic ring remains nearly planar.
In the analysis of the nonadiabatic dynamics reaction mechanism, it is meaningful to clarify the reaction channels and their corresponding major molecular motion. On top of these identifications, we can conduct the further analysis to obtain more physical-chemistry insight behind them.
We chose the typical geometries and five randomly selected hopping structures in each channel as initial guesses, and performed the CI optimization at the SA3-CASSCF(12,9)/6-31G* level with the MOLPRO package. As shown in Fig. 8, three CIs were obtained, which are Ethyl.I, Ethyl.II and CO stretching CIs, consistent with the previous work.^{76} The first two CIs are characterized as the mixture between the ππ* and ground states, and the latter one displays the nπ*/GS character.
Fig. 8 Geometries of three CIs (Ethyl.I, Ethyl.II and CO stretching) of keto isocytosine optimized at the SA3-CASSCF(12,9) level. |
When the initial geometries in the CI optimizations are chosen from the A1B1, A1B2C1 and A1B2C2 channels, the CI optimizations give Ethyl.II CI.^{75,76} If the channels A1B2C3 and A1B3 are chosen, Ethyl.I CI is obtained. These two CIs share some similarities in the presence of the C1 site puckering, but are distinguished by the different orientations of the out-the-plane motion. Here, Ethyl.II CI is preferred because the less NH_{2} torsion is required and no barrier exists.^{76} In addition, we also noticed that different conjugation statuses exist in the ring moieties at these two Ethyl CIs. The bond lengths in the ring moiety show larger changes in the A1B1, A1B2C1 and A1B2C2 channels (Ethyl.II CI) with respect to the S_{0} minimum, compared with the A1B2C3 and A1B3 channels (Ethyl.I CI). In one word, the latter channels experience less conjugation modification in the nonadiabatic decays. Therefore, the PCA results provide the additional possible reasons to explain that the Ethyl.II CI channels are preferred from the perspective of the conjugation alteration of the ring part.
In addition, the CI optimizations starting from the geometries of channel A2 give two CIs, Ethyl.II and CO stretching CIs. As discussed in Section III.3.6, the geometries of channel A2 display quasi-planarity, which is close to the structure of the CO stretching CI. Thus, the energy of this CI may be higher than the Ethyl.II CI. This again gives the additional supports in that the CO stretching CI does not play the important role here. These findings are consistent with the previous work.^{76}
In principle, the CIs are not isolated points but continuous seams in the high dimensional space. The analysis of the role of the whole CI seam is also important. In the current protocol, we clearly demonstrate that different hopping channels may be associated with a single minimum-energy CI. For example, the A1B1, A1B2C1 and A1B2C2 channels are associated with Ethyl.II CI. Therefore, it is clear that different NH_{2} statuses may exist along this seam. More detailed analyses of the topology and branching space of the CI^{102} are given in the ESI.‡
In the previous work with fewer trajectories,^{75,76} aside from the major channel A1B2C1, two minor channels A1B2C3 and A2 were also found. The branching ratio given by the previous work is qualitatively consistent with the current findings, although some minor channels (A1B1 (10.2%), A1B2C2 (12.1%) and A1B3 (7.7%)) were not discussed. One of the possible reasons is that more trajectories and the long-time (1.5 ps) propagation were taken into account in the current work. Furthermore, the traditional analysis of the TSH dynamics may not identify the very minor and detailed distinctive differences of the ring distortion in the geometric evolution.
First, we can choose the redundant internal coordinates as the geometric features to construct the descriptor sets. The redundant internal coordinates have the solid theoretical background.^{78,79} In the current work, since the ring breaks are rare in the nonadiabatic dynamics of the DNA bases as we demonstrated, that is, the current molecular connectivity remains unchanged, we choose the redundant internal coordinate set, and the construction details are well discussed in ref. 78 and 79. Since the chosen redundant internal coordinate system is used by default in the popular Gaussian 16 package, it is rather simple to use this package to generate the redundant internal coordinate set. This definitely improves the applicability of the current approach. If the molecular connectivity changes dramatically, different redundant internal coordinates for different geometries may be obtained. In this situation, we may need to set up a redundant internal coordinate set that is large enough to cover all involved internal coordinates and possible connectivity. Alternatively, it may be necessary to attempt other geometric descriptors such as the machine-learning based geometry descriptors.^{103–106} These are important research topics in the future. In addition, there are various sets of available internal coordinate systems, including the curvilinear natural internal coordinates^{81,82} and the delocalized internal coordinates by Baker et al.^{83} All these coordinate sets are easily generated by different quantum chemistry packages, which are also friendly choices in the similar analyses.
Second, we divided all internal coordinates into different descriptor sets. This division of internal coordinates gave us a more compact and suitable representation of the geometric evolution, and thus allowed us to obtain the better description of the geometric evolution in the relevant subspaces. When we want to analyse the similar systems, we may always follow the current division rules to make the preliminary analysis. In other situations, we may divide all redundant internal coordinates according to other suitable rules. For example, when dealing with other high conjugated systems with two rings, we may divide the redundant internal coordinates into three subgroups, the ring 1 part, the ring 2 part and their connectivity part. If the H atom should be considered, we may set a new subgroup containing the internal coordinates involving the H atom. Therefore, the current division strategy can be easily extended to other molecular systems for the analysis of ring deformation in the nonadiabatic dynamics. In addition, there is no “ground truth” in terms of the selection of geometric feature representations. Different specialized representations should be employed according to problems under study. Since each of these descriptors provides a data representation in the non-orthogonal space, the analysis on them may give different results. In other words, we should choose the problem-specific feature representations and suitable feature subspaces to give the appropriate description of the data distribution patterns. In the current analysis of the ring deformation in the nonadiabatic dynamics, we proposed such a simple division strategy, because that it can give the reasonable geometric features and appropriate subspaces to represent our data. For other similar problems, it is always possible to find the suitable division ways based on the current philosophy.
Thirdly, we performed the dimensionality reduction method first, and the resulting low-dimensional space provides a direct view on the data distributions, which guides the selection of clustering methods and the adjustment of corresponding parameters. Here, although our proposed analysis protocol is based on these “unsupervised” machine learning algorithms, it still requires additional human inventions to select methods and tune parameters.^{27,28} The PCA-then-clustering procedure give us the ideas on the appropriate selection of clustering approaches and corresponding parameters.
Finally, the unsupervised machine learning methods are available from many standard machine-learning libraries, such as Scikit-learn Python toolkit. Therefore, one may easily transfer our current work to analyse the aromatic ring deformation of other similar molecular systems.
Although the current protocol was mainly developed to analyse the Tully's FSSH dynamics simulation results, in principle it can certainly be used to understand the simulations by other trajectory-based or Gaussian-wavepacket based nonadiabatic methods.^{3,10,12,13,107} For example, for the Ehrenfest dynamics, we may directly extract the geometries when the trajectories experience the minimum energy gaps between different electronic states, and use them to compare with the starting geometries. In addition, the similar idea can be used to analyse the results obtained from the ab initio multiple spawning method^{10} by finding the geometries at the spawning events. After different reaction channels are identified, the further analysis can be performed to gain the chemical insight behind each of them.
In practice, we collected the internal coordinates of hopping geometries from the TSH nonadiabatic molecular dynamics simulation. Then, we constructed six descriptor sets (D_{ring}, A_{ring}, R_{ring}, D_{eg}, A_{eg} and R_{eg}). Three of them with a subscript ring only include DOFs in the ring moiety, while the other three contains end-group DOFs. Here, D, A and R denote the dihedral angles, bond angles and bond distances, respectively. Based on these descriptor sets, we hierarchically employed the PCA and clustering methods to analyse the DOFs involving the ring part and end groups successively, until each of the cluster we obtained is non-separable. In principle, each non-separable cluster corresponds to a single reaction channel. After clarifying how many decay channels exit in the current nonadiabatic dynamics, we wanted to identify the major active coordinates and other geometric features responsible for each decay channel. For each, we place the corresponding hopping geometries and their relevant initial structures together, and then performed the PCA with the above six descriptor sets again. If the hopping geometries and the initial ones are well-separated along some leading reduced coordinates, we considered their important components to be the key active coordinates of the corresponding channel.
The nonadiabatic molecular dynamics of the keto isocytosine model was used to examine this hierarchical protocol. Following the above procedure, we totally found six excited-state nonadiabatic decay channels, and their dominant molecular motions were also clarified. The current hierarchical method based on unsupervised machine learning algorithms can capture the major evolution features of the nonadiabatic dynamics of realistic systems, such as the reaction channels, the branching ratios and the corresponding dominant motions. Particularly, this protocol shows the strong ability to characterize both the major and minor active molecular motions and the important features of the ring distortion in detail. Thus, it is a powerful approach to analyse the ring deformation in the trajectory-based nonadiabatic molecular dynamics simulation.
With the development of the computational facilities and the advances of theoretical simulation approaches, the nonadiabatic dynamics under study may involve more and more complicated systems with a huge number of DOFs. In this case, the analysis of the mass amount of high-dimensional data produced by the nonadiabatic molecular dynamics simulations, such as on-the-fly TSH, becomes necessary. In this sense, it is highly preferable to develop the automated analysis protocol for this purpose. Along with this idea, the current work proposed a suitable way to perform the analysis of the ring motion in the nonadiabatic dynamics. In more complicated realistic systems, we expect that the employment of suitable geometric descriptors should be essential for such analyses. Therefore, the application of more advanced molecular descriptors^{103–106} should be rather critical and challenging in the future.
Footnotes |
† The codes of the current work are also available on GitHub (https://github.com/Yifei-Zhu/PCA_ring.git). |
‡ Electronic supplementary information (ESI) available: More details of the theoretical methods and computational details; the time-dependent occupations of electronic states and related discussions; more additional PCA results; the statistical significance analysis of the PCA results; the structural information of the S_{0} minimum, as well as the averaged structure and the typical one of each channel; the optimized structure and the branching space of each CI. See DOI: https://doi.org/10.1039/d2cp03323b |
This journal is © the Owner Societies 2022 |