Supporting Information: Machine learning enables long time scale molecular photodynamics simulations

S1.1 Neural networks (NNs) Multi-layer feed forward neural networks (NNs) have been implemented in python using the numpy1 and theano2 packages. To find (hyper)parameters of the optimal-network-architecture NNs to best fit the relation between a molecular geometry and its corresponding excited-state properties, we have automated a random grid search3 and adapted the learning rate, one of the most critical parameters during training.4 The optimization of several parameters of NNs was done for the initial training set consisting of 100 data points for the analytical model and 992 data points for CH2NH2 . After the adaptive sampling, the training set increased to twice its initial size. Each nuclear configuration was given to the NNs as the matrix of inverted distances. For optimal performance, we tested several aspects of the NN architecture, like the type of non-linear basis function (hyperbolic tangent and shifted softplus function, ln(0.5ex +0.5)) used, the number of neurons, the number of hidden layers, as well as other parameters, such as the learning rate, lr in the following equations, the L2 regularization rate, the number of epochs, the batch size, and a constant factor, η in equation 2, that regulates the influence of forces in the update step. Emphasis was put on optimization of the L2 regularization rate with respect to a given batch size and the learning rate. For each quantity that has to be predicted, a separate NN was used, except for the gradients. The latter enter directly into the NNs dedicated to the potential energy prediction. The NNs for the analytical model consisted of one or two hidden layers, whereas deep NNs with 6 hidden layers were applied for training and predictions of properties of CH2NH2 . We chose the NNs with respect to their error on the validation set as well as the form of their loss function. For the initial training set, we always sampled two different variables of the NN architectures. The learning rate was first scanned during a couple of iterations from values of 10−7 to 100 along with the L2 regularization rate (values between 10−4 and 10−14) using a fixed number of neurons (50) and hidden layers (4). After getting an idea about the magnitude of the L2 regularization rate, the decay factor as well as the step number for annealing the learning rate were evaluated. The decay factor was sampled from 0 to 1, whereas values between 0.9 and 1.0 seemed to give best NN training. The number of epochs, after which the learning rate should be annealed, was sampled from 1 to 1000 with values


S1.1 Neural networks (NNs)
Multi-layer feed forward neural networks (NNs) have been implemented in python using the numpy 1 and theano 2 packages.
To find (hyper)parameters of the optimal-network-architecture NNs to best fit the relation between a molecular geometry and its corresponding excited-state properties, we have automated a random grid search 3 and adapted the learning rate, one of the most critical parameters during training. 4 The optimization of several parameters of NNs was done for the initial training set consisting of 100 data points for the analytical model and 992 data points for CH 2 NH + 2 . After the adaptive sampling, the training set increased to twice its initial size. Each nuclear configuration was given to the NNs as the matrix of inverted distances. For optimal performance, we tested several aspects of the NN architecture, like the type of non-linear basis function (hyperbolic tangent and shifted softplus function, ln(0.5e x + 0.5)) used, the number of neurons, the number of hidden layers, as well as other parameters, such as the learning rate, lr in the following equations, the L2 regularization rate, the number of epochs, the batch size, and a constant factor, η in equation 2, that regulates the influence of forces in the update step. Emphasis was put on optimization of the L2 regularization rate with respect to a given batch size and the learning rate.
For each quantity that has to be predicted, a separate NN was used, except for the gradients.
The latter enter directly into the NNs dedicated to the potential energy prediction. The NNs for the analytical model consisted of one or two hidden layers, whereas deep NNs with 6 hidden layers were applied for training and predictions of properties of CH 2 NH + 2 . We chose the NNs with respect to their error on the validation set as well as the form of their loss function. For the initial training set, we always sampled two different variables of the NN architectures. The learning rate was first scanned during a couple of iterations from values of 10 −7 to 10 0 along with the L2 regularization rate (values between 10 −4 and 10 −14 ) using a fixed number of neurons (50) and hidden layers (4).
After getting an idea about the magnitude of the L2 regularization rate, the decay factor as well as the step number for annealing the learning rate were evaluated. The decay factor was sampled from 0 to 1, whereas values between 0.9 and 1.0 seemed to give best NN training. The number of epochs, after which the learning rate should be annealed, was sampled from 1 to 1000 with values between 1 and 100 leading to best results. After this, the type of basis function was analyzed. The hyperbolic tangent turned out to give fair results for nonadiabatic couplings (NACs) and dipole moments, whereas the shifted softplus function was slightly better for energies and gradients. We further tried batch sizes of 1, 2, 5, 10, 32, 64, 100, 128, 256, 500, and 992 for the initial training set (batch sizes of 2000 and 4000 were also tested for the larger and final training sets) along with the learning rate. A batch size of 500 turned out to lead to good training performance. The number of hidden layers was evaluated from 1 hidden layer up to 8 hidden layers along with the number of neurons from 10 to 100. For energies, gradients and dipole moments the number of neurons was set to 50 and for networks trained on NACs a slightly larger number of neurons was chosen. After assessing those parameters, the learning rate as well as the L2 regularization rate were sampled again. Now, a narrower space of those parameters was sampled relying upon the previously defined values. The influence of the decay factor and update step for annealing the learning rate was further reassessed. 4 Hidden layers were used for training set sizes up to around 2000 data points, afterwards 6 hidden layers slightly improved the NN accuracy.
Each time new parameters were sampled, we trained around 20-100 networks depending on the space to cover for a given parameter. The large initially sampled space of (hyper)parameters was narrowed after the initial training set was expanded and assumed to be a good starting point for random grid search. There were several sets of parameters that led to very similar performance and the two with the lowest error on a validation and training set were chosen for carrying out dynamics simulations.

S1.2 NN training
In general, training aims at fitting the weight parameters of a NN in order to minimize a cost function, which usually represents the mean-squared error (MSE) between predictions by NNs and reference data. For this task, we used Adam (adaptive moment estimation), 5 a stochastic gradient descent optimization algorithm. In combination with Adam, an exponential learning rate decay was applied. 3 Therefore, a decay-factor, f lr , with values between 0 and 1, as well as a step size were defined. The step size gives the number of epochs that have to be passed after lr is adjusted according to the following equation: Quantities, which can be related to a nuclear configuration and predicted by implemented NNs are energies as well as corresponding gradients, spin-orbit couplings, NACs, and dipole moments. For predictions of energies, it is favorable to include gradients in the minimization process. 6 The cost function, forces, once derived from NNs, F NN m α , and once calculated with quantum chemistry, F QC m α . The constant η regulates the influence of forces on the overall cost function and is set to 1 here. All different electronic states were treated within one NN, thus the errors entering the cost function are for all electronic states.
The reference data set was always split into a training and a validation set in a random fashion using a ratio of 9:1. For training and prediction, inputs and outputs, X, were scaled with respect to the mean, µ, and standard deviation, σ , of the training set according to equation 3.
An early stopping mechanism was used to control overfitting. Additionally, the convergence of the loss functions of the training and validation set were checked manually, ensuring, e.g., that their order of magnitude is similar. The training set of the analytical model consisted of 100 equidistant data points containing values of x between 0.4 and 1.8 (with x being the one degree of freedom of the harmonic oscillators, see Table S7). Two initial training sets were generated for CH 2 NH + 2 : one via sampling along each degree of freedom as well as a torsional mode and another one by executing dynamics simulations with SHARC but starting each trajectory from the same geometry, i.e., the equilibrium geometry, with sampled velocities according to a Wigner distribution. 7 The excitation window for these simulations was set to 7-11 eV. The training set sampled along normal modes and the torsional mode yielded a better initial fit of the different computed properties, thus the focus was set only on this set for CH 2 NH + 2 . The initial data set was insufficient for excited-state dynamics simulations and was expanded via an adaptive selection scheme as described in reference. 6 For this reason, two slightly different sets of NNs (NN1 and NN2, see Table   S1 for the differing parameters and Table S2 for the mean absolute error (MAE) obtained with those NNs) were chosen to start dynamics simulations by replacing quantum chemical calculations with quantum chemistry as well as added to the training set. This is done for each property independently and as soon as one property is predicted unreliably, a new QC reference calculation is carried out. The NNs are retrained and previous weights are not used as a guess for training NNs. For the dipole moments, the threshold was set to 0.5 a.u. (atomic units). The latter was kept constant, since dipole moments do not directly enter into the dynamics in our current approach but are implemented for future research purposes.
For the RMSE of the NACs, the threshold is initially set to 0.25 a.u., reduced adaptively, and raised intermediately to 3.0 a.u. after a training set size of 3600 points is reached. This intermediate raise makes it possible to focus only on geometries close to conical intersections, where the NAC is verly large. These are necessary to describe the dynamics accurately. Consequently, the population dynamics converges towards the correct behavior, as can be seen in Fig. S1 for the last iterations of the sampling procedure.
As already mentioned, the predefined threshold is multiplied by a factor of 0.95 after a new conformation is detected, recalculated with QC and the NNs are retrained. The initial value was adapted by this factor after each training cycle for adaptive sampling, when the training set is still built. For the long dynamics simulations (after 10 ps), two NNs are still used for detecting undersampled regions of conformational space but threshold for the RMSE between the NNs is now kept constant (at a value of 0.05 a.u. for energies). The training set size increases only slightly (by 58 samples) during this time and are not considered to lie within the previously visited regions

S1.3 Performance of NNs
In this subsection, we want to assess the performance of the generated NN potentials by comparing to different ML models and by an in-depth analysis of the predictions specific for each electronic state. We evaluate the accuracy by using ML models trained on the training set of the methylenimmonium cation computed with the QC1 method (see section S4 below). We want to point out here, that the QC reference dynamics simulations are not included in the training set and thus, reproducing the dynamics can be considered as an indirect test. However, we cannot use the points from the QC reference dynamics for a direct comparison in a straightforward manner, since these points are not phase corrected. In order to evaluate the accuracy of the NN potentials, we generated an out-of-sample test set, that contains energies and forces as well as NACs (and dipole moments for completeness) of 770 data points additionally generated via sampling of linear combinations of different normal modes of CH 2 NH + 2 . This test set is phase corrected and is made available with the training set (see file training+test_set_CH2NH2+.zip, in SHARC format). We compare our NN model with another NN model using the Coulomb matrix 8 (referred to as NN/Coulomb in the following) with 21 features (inputs) instead of the inverse distance matrix (NN/inv.D. in the following) with 15 features as a molecular descriptor. We further generate cross-products of all entries of the inverse distance matrix and use the resulting vector as descriptor (referred to as NN/poly.D. since polynomials of the inverse distances are contained). The dimension of poly.D. is thus squared compared to inv.D., hence a more accurate description and higher accuracy could be expected. To compare between different ML models, we analyze the performance of two additional regression models: Linear regression (LR) computed as a baseline model and support vector machine for regression (SVR) is used for additional comparison. In both cases, the same molecular descriptor (inv.D.) as for the reference NNs is taken.

S1.3.1 Computational details for the ML models
For training NN/Coulomb, LR and SVR models, we use the training set containing 4000 data points. We proceed in the same way as for the NN/inv.D. model described above and scale inputs and outputs for training and prediction according to equation 3. In the same manner as for the NN/inv.D. model, we sample different hyperparameters of the network architecture for the NN/Coulomb model and NN/poly.D. model (see discussion in section S1.1), but as a starting point we take optimal hyperparameters already found for the NN/inv.D. model (see Table S1). Since no better performance is obtained using different hyperparameters for the NN/Coulomb model, we keep them unchanged and only change the molecular descriptor from the matrix of inverse distances to the Coulomb matrix 8 , C, that additionally includes the atomic charges in the representation. The diagonal elements, C ii , which are absent in the inv.D. descriptor, are constant values defined as and off-diagonal elements, C i j , are computed as 1-25 | 6 Z i in equations 6 and 7 refers to the nuclear charge of atom i and R i is the position of atom i. All electronic states are treated within one NN. Again, forces are trained together with potential energies and predicted as their derivatives. A separate NN is used for training of NACs. In case of the NN/poly.D. model, a set of slightly different hyperparameters is chosen from hyperparameter search for energies and gradients and given in Table S3. The hyperparameters for nonadiabatic couplings are kept unchanged from NN/inv.D. models.

Table S3
Selected parameters to construct NN/poly.D. models trained on 4000 data points of CH 2 NH + 2 . The parameter, η in equation 2, to control the influence of the forces for training of energies is set to 1. The parameters for NNs trained on NACs are depicted in Table S1.

Property Energy
Number of hidden layers 6 Number of neurons per hidden layer 50 Batch size 500 Learning rate, lr 3.57 · 10 −3 Decay factor, f lr 0.950 Update steps for annealing of lr 62 L2 regularization rate 6.10 · 10 −8

Basis function Shifted softplus
To carry out LR and SVR, scikit-learn 9 is used. Each electronic state is treated separately and the forces and NACs are trained independently from the energies. Ordinary least squares LR is used as implemented in scikit-learn. For SVR, we tested different kernel types, mainly linear kernel, radial basis function and polynomials. For final training and prediction, radial basis functions are used.
The penalty parameter of the error term is set to a value of 100 and the rest of the parameters are kept unchanged from default values, because the performance does not change considerably by setting different parameters.

S1.3.2 Comparison of different ML models
First, we seek to evaluate the performance of our NNs by comparison to other ML models. In our experiments, we carry out dynamics simulations with 2 NNs. Therefore, we compute the MAE averaged over all states for energies, forces, and NACs on the test set of 770 data points with two models. For the NN/inv.D. models, NN1 and NN2 are used, as they are in dynamics simulations. For the rest of the models, we split the data set of 4000 points in different training and validation sets using a ratio of 9:1 and use different initial weights. The results are given in Table S4 (for completeness, the MAE on dipole moments for our model is 0.168 a.u., but will not be discussed here). As can be seen in the table, the NNs with different descriptors reach similar accuracy. Especially, the inverse distance matrix and the Coulomb matrix perform very similarly, as would be expected from such closely related descriptors. Due to the similarity, we keep the simplest of these descriptors (inv.D.) for further analysis. SVR is worse than the NN approaches with an error 1-25 | 7 for energies and gradients approximately twice as large as for NNs. The SVR also delivers worse predictions of NACs compared to the NNs. LR, as a baseline model, has a MAE that is about a factor of 10 larger than the one of SVR for energies and gradients and about 1.5 times larger for NACs, demonstrating the utility of our NN approach described above.

S1.3.3 State-specific analysis of the NNs
In order to clarify whether there is a bias of our ML model toward a specific state, we evaluate the errors of our reference NN/inv.D. model on energies and gradients on the same test set of 770 data points for each electronic state separately. As can be seen in Table S5, the MAE of the S 2 -state energies and gradients is approximately twice as large as the one of the two lower states. Analysis of different reaction coordinates of the PES reveal problems within QC calculations that become pronounced especially in critical regions of the PES and lead to erratic potential energy curves.
To show this problem, a scan along a reaction coordinate that includes two avoided potential energy curve crossings of the molecule is exemplified in Fig. S2. As can be seen, a jump in the S 2 potential energy curve is obtained with the QC1 reference method. The source of this behaviour might be due to higher electronic states that enter along the reaction coordinate ("intruder states") or new electronic configurations of the molecule that were not included in the active space from the beginning. However, the NNs used in this work (panel A in Fig. S2) do not reproduce this discontinuity and predict smooth potential energy curves, certainly due to the use of gradients in the loss function for the training. As a consequence, the MAE obtained for the S 2 state must be artificially higher than for the rest of the states. For completeness, we carry out the same scan with the other ML models and as expected, the potential energy curves obtained with NN/Coulomb (panel B in Fig. S2) are very similar to the  Table S4 is also obtained for the other models. The SVR model (panel E) gets the potential energy curves slightly worse than the NNs and LR is far off from the reference potential energy curves (panel D). An important feature of the ML models used here is their tendency towards a smooth interpolation of the potential energy surfaces (see Fig. S2). As a consequence, high artificial errors and discontinuities introduced by the reference method can be compensated to a certain extent.
Moreover, since all ensemble models predict the same trend in such situations (see e.g. NN1 and NN2 Fig. S2A), a low uncertainty is reported for the afflicted regions and no potentially detrimental data is added to the data set during the adaptive sampling. This feature is desired and advantageous since the ML models correct for the erratic behavior of QC1.
Nevertheless, such regions often coincide with critical regions of the PES and should be considered with care. To highlight such regions in conformational space that should be sampled more comprehensively in the training set, some other metrics could be considered. For example, very small energy gaps between different states or large coupling values are an indicator for a crossing point of two potentials. Additionally, the root-mean squared displacement (RMSD) from a new configuration to all other configurations contained in the training set could be calculated and a threshold could be set, which tells us when a new geometry is dissimilar to the current training set and should be computed with QC and added to the training set. Another possibility would be to add more data points from trajectories where the gradient is large, which is done in Ref. 10  example for the generation of a database. This is important, because during dynamics simulations the molecule will remain longer in regions where the forces are small, hence enlarging the training set mainly by data points with small gradients. 10 Another procedure is used for Shepard interpolation. 11 There, a statistical approach with no significant additional costs is used to determine the uncertainty of the potential energy by measuring its variance.

S2 Phase correction
We carried out a phase correction procedure to obtain a training set suitable for our ML algorithm. To this aim, a phase vector, p, had to be derived for each molecular geometry containing values of +1 and -1 for each state. This vector was obtained by the computation of wave function overlaps between two different geometries. The overlap matrix, S, between wave functions of following nuclear configurations, Ψ k and Ψ l , was computed using the WFoverlap code: 12 The overlap matrix has the dimension of N states × N states , with N states being the number of electronic states included in a calculation. When the molecular geometries are similar enough to each other, overlaps are sufficiently large to gain information whether a phase switch has occurred or not.
Wave function overlaps for an electronic state that are close to +1 indicate that no phase change has taken place, whereas overlaps close to -1 point out a change in the phase of a wave function.
Far from conical intersections, p usually corresponds to the diagonal matrix elements of S and also has the dimension of N states . Whenever two PESs are in close proximity, off-diagonal elements of S can be larger than diagonal elements. In such cases, values of p depend on absolute values of row elements of S. A threshold of ±0.5 was set between overlaps of two following wave functions to ensure that the correct value of +1 or -1 was taken into account. Each entry, p i , of p was then calculated according to Equation 9 with i and j running over all N states .
p i = sgn(max(| S i j |)sgn(S i j ))∀ | S i j |≥ 0.5; i, j = 1, 2, ..., N states However, many situations arise, in which computed overlaps between two geometries yield values close to 0. With these, it is not possible to decide reliably whether a phase change is present or not. Therefore, an interpolation of n steps between those two molecular geometries needed to be carried out, with n usually being in the range of 5 to 10. Interpolation was performed with respect to a molecule's Z matrix. Transformation from xyz format to a Z matrix format and vice versa was executed with OpenBabel. 13 The size of interpolation steps was sufficient when each computation yielded overlaps close to +1 or -1. The phase vector, p n , applicable to the last point, n, of the required interpolation was found by multiplication of all previous phase vectors, p 0 to p n−1 : p 0 always corresponds to the phase vector between the reference geometry, upon which the complete training set is phase corrected, and the first interpolation step. The reference geometry for global phase correction was chosen to be the equilibrium geometry.
If no row element of S had an absolute value above or equal to ±0.5, the interpolation was stopped and more interpolated nuclear configurations between the equilibrium geometry and the one that needs to be included in the training set were generated. If there were still not enough 1-25 | 11 interpolated configurations to account for sufficiently large overlaps between two subsequent wave functions, the simulations were stopped. One reason for too small overlaps are "intruder states", which are excluded at the reference geometry, but are included at another geometry due to an energy drop. Thus, a previously included state is excluded such that the overall molecular electronic wave function changes considerably.
The phase correction of each matrix M -in our case, the Hamiltonian matrix, which contains energies and couplings, and the dipole matrices for each direction of the coordinate systemwas carried out according to Equation 11. Matrices containing all relevant nonadiabatic coupling vectors to form the nonadiabatic coupling tensor with the dimension N states xN states xN atoms x3 (with N atoms being the number of atoms) were corrected according to Equation 12. Each two dimensional nonadiabatic coupling vector, NAC i j , accounts for NACs between two states i and j. It was corrected by multiplication of every element with the entry of p for each state i, p i , and j, p j .

S3 Analytical Model
In order to check the performance of our deep learning molecular dynamics approach we constructed an one-dimensional, diabatic analytical model consisting of five harmonic oscillators, with the analytical interface of SHARC. 14,15 The potential energy curves are defined in Table S6. Nonadiabatic potential couplings (in contrast to nonadiabatic derivative couplings) are responsible for transitions between the 1 X and 1 A state while spin-orbit couplings coupled states of different spin multiplicity, i.e. 1 X and 3 B as well as 1 A and 3 B. These couplings as well as dipole moments between different states are described by constant values given in Table S7 Table S9. This first proof-of-concept thus shows the ability of our ML method to describe excited-state dynamics including nonadiabatic potential and spin-orbit couplings.

Table S7
Couplings between different electronic states of the analytical model.

S4.1 Computational details for QC
The quantum chemical reference calculations of the methylenimmonium cation, CH 2 NH + 2 , were done with the program COLUMBUS 16 which uses the implementation of nonadiabatic coupling vectors via Dalton 17 integrals. CH 2 NH + 2 possesses 16 electrons. Molecular orbitals were optimized with the state-averaged complete active space self-consistent field (SA-CASSCF) method, averaged over 3 states. The CAS was set to 6 active electrons in 4 active orbitals. The two lowest molecular orbitals representing the 1s orbitals of the C and N atom (Fig. S4A) were frozen at the MRCI level of theory. The next three lowest energetic orbitals were set to be inactive (Fig. S4B) and 4 orbitals were assigned to the reference space with 6 active electrons (Fig. S4C). Single and double excitations from the (6,4) reference space to 73 virtual orbitals yield a total of about 660000 configuration state functions. Determinants from inactive orbitals to virtual orbitals for single and double excitation are included within MRCI-SD. The frequency calculation was carried out with COLUMBUS. We performed scans with 100 points along each normal mode (Table S10) as well as with 72 points along the torsion around the central double bond in order to generate the initial training set. From these scans, we removed data points from computations that did not show proper convergence. After adaptive sampling, we ended up with a training set of 4000 data points.

S4.2 Surface-hopping molecular dynamics
As a prerequisite for the dynamics, 1000 initial conditions were sampled from the Wigner distribution of the quantum harmonic oscillator defined by the above-mentioned frequency calculation. From these 1000 possible starting points, 200 were excited -according to the oscillator strengths in the excitation window of 9.44 ± 0.15 eV -to the brightest state, which is the second excited 1-25 | 16 singlet state. This excitation shows mostly ππ * character. Each trajectory was propagated for 100 fs with a time step of 0.5 fs for nuclear motion and 0.02 fs for integration of the time-dependent Schrödinger equation. Trajectories showing problems with energy conservation due to improper convergence of the quantum chemistry were excluded, resulting in 90 trajectories for QC1 and 88 trajectories for QC2. However, the trend of the populations when all trajectories were taken into account is the same. Therefore, no bias was introduced by sorting out trajectories due to bad energy convergence within quantum chemical calculations, which appeared mainly around conical intersections due to convergence problems. Such problems do not occur in the NNs, which demonstrates another advantage of the ML approach. constants discussed in the main text are obtained using the tools of SHARC. 14,18 By solving differential equations that describe the kinetics of the underlying model, a fit for the population transfer can be obtained and rate constants can be derived. In addition to the population dynamics, we also checked whether visited molecular geometries are comparable along the trajectories. We first calculated the mean of each nuclear configuration 1-25 | 17 over time from the NNs leading to populations given in Fig. S5 and computed the RMSD to the mean molecular geometry at the respective time step of the ensemble predicted with MR-CISD/aug-cc-pVDZ, i.e. QC1 (Fig. S5B continuous line). An analogous comparison is also carried out for QC1 and QC2 (MR-CISD/6-31++G**, Fig. S5B, dotted line). For completeness, we also computed the RMSD between NN and QC2 (Fig. S5B, dashed line). All RMSDs are of comparable size, further validating our NN approach. Further, we checked the energy conservation of trajectories with the reference method QC1 and NNs independently. For both methods, we computed the mean and standard deviation (Std.) of the total energy along a trajectory, where we once allow hopping and once force the trajectories to stay in the S 2 state after excitation. Results are shown in Table S11. As can be obtained, the mean total energy as well as the Std. along trajectories propagated for 100 fs are comparable among the methods. However, the Std. increases slightly when hops are forbidden between different energetic states. This trend is obtained with both methods. Moreover, when a time step of 0.05 fs is used instead of 0.5 fs for classical propagation of the nuclei, the energy conservation improves, which meets expectations. Worth mentioning is that, in the case of QC1, trajectories which show large steps in total energy or do not reach 100 fs at all are removed prior to analysis. This is not the case for NNs, as all trajectories were suitable for analysis and no prior selection was carried out.

S4.3 Conical intersections (CIs)
With each method, QC1, QC2, and NNs, we optimized the geometries of the two S 1 /S 0 and S 2 /S 1 CIs. Optimizations were executed with the SHARC tools that utilize an external optimizer of ORCA, 19 where the computed energies and gradients 20,21 from COLUMBUS or NNs are fed in.
Geometries are shown in Fig. S6. They were optimized starting from the different hopping geometries obtained with the afore-mentioned methods, as indicated in the scatter plot of Fig. 6 in Table S11 MAE and standard deviation (Std.) of the total energy obtained with the QC1 (MR-CISD/aug-cc-pVDZ) method and neural networks (NN). In the case of a time step of 0.5 fs for classical propagation of the nuclei, 90 trajectories were analyzed, whereas 50 trajectories were used for analysis of trajectories with a time step of 0.05 fs.  Fig. S7. The optimized molecular geometries agree well. Cartesian coordinates are given in Table S12-S15 and are further added as ascii-files to the supplementary material.

Fig. S6 Conical intersections (CIs).
For each method, we could find two different CIs. QC1 (MR-CISD/aug-cc-pVDZ) is used as the reference method to train neural networks (NNs) and compared to CIs obtained with QC2 (MR-CISD/6-31++G**). For the two CIs, the bond length between the nitrogen (blue) and the carbon (grey) atom is shown, as well as the bond length between the carbon and one hydrogen atom. Values are given in angstrom. A dihedral angle between four atoms marked with the dashed line is given, as well as an angle between the carbon and a hydrogen atom (S 1 /S 0 CI) and between two hydrogen atoms (S 2 /S 1 CI).
The S 1 /S 0 CI shows a rotation of the molecule around the H3-N-C-H4 dihedral angle (see Fig.   S7 for the definition) of ∼ 90 • . Due to symmetry of the molecule, the CI is obtained at ∼ 90 • as well as at ∼ −90 • . The optimized geometries agree well between all methods (QC1, QC2, NN).
All methods yield hopping geometries that are distributed in a rather large region around these points. The S 2 /S 1 CI, is defined by a slight bi-pyramidalization and a bond elongation between the carbon and the nitrogen atom of around 1.44 Å (see Fig. S6) compared to a bond length of around 1.29 Å at the equilibrium geometry (Fig. S7B). Again, molecular geometries obtained with different methods are very similar. The scatter plot in the right panel of Fig. S7A shows the good agreement among different methods and also depicts the distribution of data points in the final training set.  (B) Shown is the equilibrium geometry of methylenimmonium cation in order to specify the atoms that describe the used dihedral and pyramidalization angle in Fig. 5 and Fig. 6 in the main text and Fig. S6 here. The pyramidalization angle that is used to describe the S 1 /S 0 CI is defined by the angle between the bond between the nitrogen (N) and the carbon (C) atom (red line) and the plane that is spanned by the C atom and the two hydrogen atoms, H2 and H4 (red triangle). The angle between the planes spanned by atoms H3-N-C and N-C-H4 is chosen to distinguish geometries describing both CIs is defined by hydrogen atoms, H3 and H4, as well as by the C and N atom (dashed black line).

S4.4 Movie S1
A trajectory molecular dynamics simulation of the methylenimmonium cation, CH 2 NH + 2 , during 100 fs after its excitation to the second excited singlet state predicted with deep NNs in combination with SHARC can be found in the supplementary material along with the potentials, on which the molecule moves at each time step.

S4.5 Movie S2
A trajectory molecular dynamics simulation of the methylenimmonium cation, CH 2 NH + 2 , during 10 ps after its excitation to the second excited singlet state predicted with deep NNs in combination with SHARC can be found in the supplementary material.

S5 Future perspectives
As an outlook we expect our approach to offer new possibilities to study also excited-state dynamics of larger molecules on longer timescales. By using the adaptive sampling procedure in combination with NN dynamics simulations, an initial training set can be automatically expanded until the conformational space of a molecule important for dynamics is sampled comprehensively. In the following, a short discussion on expected scaling of the methods and possibilities to treat larger molecules is given.
As a good starting point for the initial training set we suggest sampling of normal modes and linear combinations of different normal modes. Alternatively, the points obtained from a geometry optimization of critical points like CIs might deliver an even better initial set. As a guideline, we suggest from our limited experience that approximately 1000 data points should be considered for the initial training set. With this procedure, some critical regions of the potential energy surface close to the equilibrium geometry can already be covered in the training set. Further sampling of reaction coordinates that are considered to be important during excited-state molecular dynamics simulations seems to be a good guess. As an example, a reaction coordinate that involves the dissociation of an atom, can be included. Dependent on the size and complexity of the molecule as well as on the degree of initial sampling, the trajectories will be interrupted more frequently in the initial stages of an adaptive sampling run. In terms of training set sizes in general, we do not expect a linear scaling along with the number of atoms. The number of necessary data points does not only depend on the number of atoms of a molecule, but also on its excited state dynamics dictated for example by its flexibility or the excitation energy. The more processes that can take place after excitation of a molecule, the larger the space is that needs to be covered. Further, the number of states included in dynamics simulations as well as the quantum chemical reference method play a role and should be considered. In general, no rule can be given in terms of optimal training set size, but the convergence behavior of the adaptive sampling runs can serve as an indicator on whether additional reference data is required. Finally, we want to point out that using the adaptive sampling scheme -as soon as enough data points for a machine learning model can be obtained within a reasonable time frame -seems to be most efficient for sampling the relevant space visited during a dynamics simulation. We further recommend to take special care when selecting the number of electronic states. As already discussed in the main text within the section on the phase correction algorithm, it is important to compute more electronic states for molecules that have a lot of electronic states lying close in energy. However, those additional states should only be used for phase correction and computation of wavefunction overlaps and do not need to be considered for couplings or computation of forces. With this approach, problems arising due to "intruder states" during the phase correction procedure can be avoided. Regarding the network architecture, we recommend to adapt it to the complexity of the molecule and expect approximately linear scaling of both, the size of the network and its computational performance, with the number of atoms of a molecule. Last but not least, we want to comment on non-molecule specific excited-state potentials, similar to the idea of universal force fields that are in part already available for the ground state.

1-25 | 22
We expect this to be a very complex task that has to be considered with a lot of care. If molecules show similar excited state dynamics -for example the methylenimmonium cation and the ethylene molecule 22 are isoelectronic and show a rotation along the dihedral angle for the S 1 /S 0 transition -a non-molecule specific method can be thought of to describe this transition. However, the order of higher-lying states is different in these two isoelectronic species and higher excitation energies will induce completely different dynamics in the respective systems. Therefore and also due to the problematic generalization and molecule-specific couplings and excited-state potentials, we do not expect a non-molecule specific model to be available in the near future.