Andrew J.
Ballard
a,
Ritankar
Das
a,
Stefano
Martiniani
a,
Dhagash
Mehta
b,
Levent
Sagun
c,
Jacob D.
Stevenson
d and
David J.
Wales
*a
aUniversity Chemical Laboratories, Lensfield Road, Cambridge CB2 1EW, UK. E-mail: dw34@cam.ac.uk
bDepartment of Applied and Computational Mathematics and Statistics, University of Notre Dame, IN, USA
cMathematics Department, Courant Institute, New York University, NY, USA
dMicrosoft Research Ltd, 21 Station Road, Cambridge, CB1 2FB, UK
First published on 3rd April 2017
Machine learning techniques are being increasingly used as flexible non-linear fitting and prediction tools in the physical sciences. Fitting functions that exhibit multiple solutions as local minima can be analysed in terms of the corresponding machine learning landscape. Methods to explore and visualise molecular potential energy landscapes can be applied to these machine learning landscapes to gain new insight into the solution space involved in training and the nature of the corresponding predictions. In particular, we can define quantities analogous to molecular structure, thermodynamics, and kinetics, and relate these emergent properties to the structure of the underlying landscape. This Perspective aims to describe these analogies with examples from recent applications, and suggest avenues for new interdisciplinary research.
Understanding how emergent observable properties of molecules and condensed matter are encoded in the underlying potential energy surface is a key motivation in developing the theoretical and computational aspects of energy landscape research. The fundamental insight that results has helped to unify our understanding of how structure-seeking systems, such as ‘magic number’ clusters, functional biomolecules, crystals and self-assembling structures, differ from amorphous materials and landscapes that exhibit broken ergodicity.3–6 Rational design of new molecules and materials can exploit this insight, for example associating landscapes for self-assembly with a well defined free energy minimum that is kinetically accessible over a wide range of temperature. This structure, where there are no competing morphologies separated from the target by high barriers, corresponds to an ‘unfrustrated’ landscape.2,5,7,8 In contrast, designs for multifunctional systems, including molecules with the capacity to act as switches, correspond to multifunnel landscapes.9,10
In this Perspective we illustrate how the principles and tools of the potential energy landscape approach can be applied to machine learning (ML) landscapes. Some initial results are presented, which indicate how this view may yield new insight into ML training and prediction in the future. We hope our results will be relevant for both the energy landscapes and machine learning communities. In particular, it is of fundamental interest to compare these ML landscapes to those that arise for molecules and condensed matter. The ML landscape provides both a means to visualise and interpret the cost function solution space and a computational framework for quantitative comparison of solutions.
Construction of a kinetic transition network3,12,13 also provides a convenient means to visualise a high-dimensional surface. Disconnectivity graphs4,18 represent the landscape in terms of local minima and connected transition states, reflecting the barriers and topology through basin connectivity. The overall structure of the disconnectivity graph can provide immediate insight into observable properties:4 a single-funnelled landscape typically represents a structure-seeking system that equilibrates rapidly, whereas multiple funnels indicate competing structures or morphologies, which may be manifested as phase transitions and even glassy phenomenology. Locating the global minimum is typically much easier for single funnel landscapes.19
The decomposition of a surface into minima and transition states is quite general and can naturally be applied to systems that do not correspond to an underlying molecular model. In particular, we can use this strategy for machine learning applications, where training a model amounts to minimisation of a cost function with respect to a set of parameters. In the language of energy landscapes, the machine learning cost function plays the role of energy, and the model parameters are the ‘coordinates’ of the landscape. The minimised structures represent the optimised model parameters for different training iterations. The transition states are the index one saddle points of the landscape.20
Energy landscape methods2 could be particularly beneficial to the ML community, where non-convex optimisation has sometimes been viewed as less appealing, despite supporting richer models with superior scalability.21 The techniques described below could provide a useful computational framework for exploring and visualising ML landscapes, and at the very least, an alternative view to non-convex optimisation. The first steps in this direction have recently been reported.22–24 The results may prove to be useful for various applications of machine learning in the physical sciences. Examples include fitting potential energy surfaces, where neural networks have been used extensively25–32 for at least 20 years.33–35 Recent work includes prediction of binding affinities in protein–ligand complexes,36 applications to the design of novel materials,37,38 and refinement of transition states39 using support vector machines.40
In the present contribution we illustrate the use of techniques from the energy landscapes field to several ML examples, including non-linear regression, and neural network classification. When surveying the cost function landscapes, we employed the same techniques and algorithms as for the molecular and condensed matter systems of interest in the physical sciences: specifically, local and global minima were obtained with the basin-hopping method41–43 using a customised LBFGS minimisation algorithm.44 Transition state searches employed the doubly-nudged45,46 elastic band47,48 approach and hybrid eigenvector-following.49,50 These methods are all well established, and will not be described in detail here. We used the python-based energy landscape explorer pele,51 with a customised interface for ML systems, along with the GMIN,52 OPTIM,53 and PATHSAMPLE54 programs, available for download under the Gnu General Public Licence.
The test system considered here is a simple triatomic cluster with four distinguishable local minima that represent the possible outcomes for local minimisation. This system has previously served as a benchmark for visualising the performance of different geometry optimisation approaches.59–61 The potential energy, V, is a sum of pairwise terms, corresponding to the Lennard-Jones form,62 and the three-body Axilrod–Teller function,63 which represents an instantaneous induced dipole–induced dipole interaction:
(1) |
The objective of our machine learning calculations for this system is a classification, to predict which of the four minima a local minimisation would converge to, given data for one or more configurations. The data in question could be combinations of the energy, gradient, and geometrical parameters for the structures in the optimisation sequence. Our initial tests, which are mostly concerned with the structure of the ML solution landscape, employed the three interparticle separations r12, r13 and r23 as data.23 Inputs were considered for separations corresponding to the initial geometry, and for one or more configurations in the minimisation sequence.
A database of 10000 energy minimisations, each initiated from different atomic coordinates distributed in a cube of side length , was created using the customised LBFGS routine in our OPTIM program53 (LBFGS is a limited memory quasi-Newton Broyden,64 Fletcher,65 Goldfarb,66 Shanno,67 scheme). Each minimisation was converged until the root mean square gradient fell below 10−6 reduced units, and the outcome (one of the four minima) was recorded.
The neural networks used in the applications discussed below all have three layers, corresponding to input, output, and hidden nodes.68 A single hidden layer has been used successfully in a variety of previous applications.69 A bias was added to the sum of weights used in the activation function for each hidden node, wbhj, and each output node, wboi, as illustrated in Fig. 1. For inputs we consider X = {x1,…,xNdata}, where Ndata is the number of minimisation sequences in the training or test set, each of which has dimension Nin, so that . For this example there are four outputs corresponding to the four possible local minima (as in Fig. 1), denoted by yNNi, with
(2) |
The four outputs were converted into softmax probabilities as
(3) |
In each training phase we minimise the cost (objective) function, ENN(W;X), with respect to the variables w(1)ij, w(2)jk, wbhj and wboi, which are collected into the vector of weights W. An L2 regularisation term is included in ENN(W;X), corresponding to a sum of squares of the independent variables, multiplied by a constant coefficient λ, which is chosen in advance and fixed. This term is intended to prevent overfitting; it disfavours large values for individual variables. We have considered applications where the regularisation is applied to all the variables, and compared the results with a sum that excludes the bias weights. Regularising over all the variables has the advantage of eliminating the zero Hessian eigenvalue that otherwise results from an additive shift in all the wboi. Such zero eigenvalues are a consequence of continuous symmetries in the cost function (Noether's theorem). For molecular systems such modes commonly arise from overall translation and rotation, and are routinely dealt with by eigenvalue shifting or projection using the known analytical forms for the eigenvectors.2,70 Larger values of the parameter λ simplify the landscape, reducing the number of minima. This result corresponds directly with the effect of compression for a molecular system,71 which has been exploited to accelerate global optimisation. A related hypersurface deformation approach has been used to treat graph partitioning problems.72
For each LBFGS geometry optimisation sequence, d, with Nin inputs collected into data item xd, we know the actual outcome or class label, c(d) = 0, 1, 2 or 3, corresponding to the local minimum at convergence. The networks were trained using either 500 or 5000 of the LBFGS sequences, chosen at random with no overlap, by minimising
(4) |
In the testing phase the variables W are fixed for a particular local minimum of ENN(W;Xtrain) obtained with the training data, and we evaluate ENN(W;Xtest) for 500 or 5000 of the minimisation sequences outside the training set. The results did not change significantly between the larger and smaller training and testing sets.
We first summarise some results for ML landscapes corresponding to input data for the three interparticle distances at each initial random starting point. The number of local minima obtained23 was 162, 2559, 4752 and 19045 for three, four, five and six hidden nodes, respectively, with 1504, 10112, 18779 and 34052 transition states. The four disconnectivity graphs are shown in Fig. 2. In each case the vertical axis corresponds to ENN(W;Xtrain), and branches terminate at the values for particular local minima. At a regular series of thresholds for ENN we perform a superbasin analysis,18 which segregates the ML solutions into disjoint sets. Local minima that can interconvert via a pathway where the highest transition state lies below the threshold are in the same superbasin. The branches corresponding to different sets or individual minima merge at the threshold energy where they first belong to a common superbasin. In this case we have coloured the branches according to the misclassification index, discussed further in Section V, which is defined as the fraction of test set images that are misclassified by the minimum in question or the global minimum, but not both. All the low-lying minima exhibit small values, meaning that they perform much like the global minimum. The index rises to between 0.2 and 0.4 for local minima with higher values of ENN(W;Xtrain). These calculations were performed using the pele51 ML interface for the formulation in eqn (2), where regularisation excluded the weights for the bias nodes.23
When more hidden nodes are included the dimensionality of the ML landscape increases, along with the number of local minima and transition states. This observation is in line with well known results for molecular systems: as the number of atoms and configurational degrees of freedom increases the number of minima and transition states increases exponentially.73,74 The residual error reported by ENN(W;Xtrain) decreases as more parameters are included, and so there is a trade-off between the complexity of the landscape and the quality of the fit.23
The opportunities for exploiting tools from the potential energy landscape framework have yet to be investigated systematically. As an indication of the properties that might be of interest we now illustrate a thermodynamic analogue corresponding to the heat capacity, CV. Peaks in CV are particularly insightful, and in molecular systems they correspond to phase-like transitions. Around the transition temperature the occupation probability shifts between local minima with qualitatively different properties in terms of energy and entropy: larger peaks correspond to greater differences.75 For the ML landscape we would interpret such features in terms of fitting solutions with different properties. Understanding how and why the fits differ could be useful in combining solutions to produce better predictions. Here we simply illustrate some representative results, which suggest that ML landscapes may support an even wider range of behaviour than molecular systems.
To calculate the CV analogue we use the superposition approach2,73,76–79 where the total partition function, Z(T), is written as a sum over the partition functions Zα(T), for all the local minima, α. This formulation can be formally exact, but is usually applied in the harmonic approximation using normal mode analysis to represent the vibrational density of states. The normal mode frequencies are then related to the eigenvalues of the Hessian (second derivative) matrix, μα(i), for local minimum α:
(5) |
Many molecular and condensed matter systems exhibit a CV peak corresponding to a first order-like melting transition, where the occupation probability shifts from a relatively small number of low energy minima, to a much larger, entropically favoured, set of higher energy structures, which are often more disordered. However, low temperature peaks below the melting temperature, corresponding to solid–solid transitions, can also occur. These features are particularly interesting, because they suggest the presence of competing low energy morphologies, which may represent a challenge for structure prediction,80 and lead to broken ergodicity,55,79,81–86 and slow interconversion rates that constitute ‘rare events’.14,87,88 Initial surveys of the CV analogue for ML landscapes, calculated using analytical second derivatives of ENN(Wα;X), produced plots with multiple peaks, suggesting richer behaviour than for molecular systems.
One example is shown in Fig. 3, which is based on the ML solution landscape for a neural network with three hidden nodes and inputs corresponding to the three interparticle distances at the initial geometries of all the training optimisation sequences. These results were obtained using the pele51 ML interface for the neural network formulation in eqn (2), where regularisation did not include the weights for the bias nodes.23 The superposition approach provides a clear interpretation for the peaks, which we achieve by calculating CV from partial sums over the ML training minima, in order of increasing ENN(Wα;Xtrain). The first peak around kBT ≈ 0.2 arises from competition between the lowest two minima. The second peak around kBT ≈ 9 is reproduced when the lowest 124 minima are included, and the largest peak around kBT ≈ 20 appears when we sum up to minimum 153. The latter solution exhibits one particularly small Hessian eigenvalue, producing a relatively large configurational entropy contribution, which increases with T. The harmonic approximation will break down here, but nevertheless serves to highlight the qualitative difference in character of minima with exceptionally small curvatures.
In molecular systems competition between alternative low energy structures often accounts for CV peaks corresponding to solid–solid transitions, and analogues of this situation may well exist in the ML scenario. Some systematic shifts in the CV analogue could result from the density of local minima on the ML landscape (the landscape entropy89–92). Understanding such effects might help to guide the construction of improved predictive tools from combinations of fitting solutions. Interpreting the ML analogues of molecular structure and interconversion rates between minima might also prove to be insightful in future work.
A subsequent study investigated the quality of the predictions using two of the three interatomic distances, r12 and r13, and the effects of memory, in terms of input data from successive configurations chosen systematically from each geometry optimisation sequence.93 The same database of LBFGS minimisation sequences for the triatomic cluster was used here, divided randomly into training and testing sets of equal size (5000 sequences each). The quality of the classification prediction into the four possible outcomes can be quantified using the area under curve (AUC) for receiver operating characteristic (ROC) plots.69 ROC analysis began when radar receiver operators needed to distinguish signals corresponding to aircraft from false readings, including flocks of birds. The curves are plots of the true positive rate, Tpr, against the false positive rate, Fpr, as a function of the threshold probability, P, for making a certain classification. Here, P is the threshold at which the output probability pNN0(W;X) is judged sufficient to predict that a minimisation would converge to the equilateral triangle, so that
(6) |
(7) |
Fig. 4 shows the AUC values obtained with the LBFGS database when the input data consists of r12 and r13 values at different points in the minimisation sequence. Here the horizontal axis corresponds to s, the number of steps from convergence, and the AUC values therefore tend to unity when s is small, where the configurations are close to the final minimum. Each panel in the figure includes two plots, which generally coincide quite closely. The plot with the lower AUC value is the one obtained with the global minimum located for ENN(Wα;Xtrain) for configurations s steps for convergence. The AUC value in the other plot is always greater than or equal to the value for the global minimum, since it is the maximum AUC calculated for all the local minima characterised with the same neural net architecture and any s value. Minimisation sequences that converge in fewer than s steps are padded with the initial configuration for larger s values, which is intended to present a worst case scenario.
Fig. 4 shows that the prediction quality decays in a characteristic fashion as configurations move further from the convergence limit. It also illustrates an important result, namely that the performance of the global minimum obtained with the training data is never surpassed significantly by any of the other local minima, when all these fits are applied to the test data. Hence the global minimum is clearly a good place to start if we wish to make predictions, or perhaps construct classification schemes based on more than one local minimum of the ML landscape obtained in training. The corresponding fits and AUC calculations were all rerun to produce Fig. 4 using the fitting function defined in eqn (2) and regularisation over all variables, for comparison with the simplified bias weighting employed in ref. 93. There is no significant difference between our results for the two schemes.
y(x;q) = e−q1xsin(q2x + q3)sin(q4x + q5). | (8) |
We performed regression on the above problem, with a dataset D consisting of Ndata = 100 data points with xi values sampled uniformly in [0,3π], and corresponding ti values given by our model with added Gaussian white noise (mean zero, σ = 0.02):
ti = y(xi;q*) + noise, | (9) |
(10) |
We explored the landscape described by E(q), and display the various solutions in Fig. 5 alongside our data and y(x;q = q*). In this case, the global minimum is a fairly accurate representation of the solution used to generate D. However, 88 other solutions were found which do not accurately represent the data, despite being valid local minima. In Fig. 6 we show the disconnectivity graph for E(q). Here the vertical axis corresponds to E(q), and branches terminate at the values for corresponding local minima, as in Section III. The graph shows that the energy of the global minimum solution is separated from the others by an order of magnitude, and is clearly the best fit to our data (dotted curve in Fig. 5). The barriers in this representation can be interpreted in terms of transformations between different training solutions in parameter space, and could be indicative of the distinctiveness of the minima they connect. The minima of this landscape were found by the basin-hopping method,41–43 as described in Section II.
Fig. 5 Results from nonlinear regression for the model given by eqn (8): the global minimum of the cost function (dashed line) is plotted with various local minima solutions (solid lines) and the data used for fitting (black circles). The model used to generate the data is indistinguishable from the curve corresponding to the global minimum. |
Fig. 6 Disconnectivity graph for a nonlinear regression cost function [eqn (8)]. Each branch terminates at a local minimum at the value of E(q) for that minimum. By following the lines from one minimum to another, one can read off the energy barrier on the minimum energy path connecting them. |
The neural network defined above is quite small, and is not intended to compete with well-established models trained on MNIST.95 Rather, our goal in this Perspective is to gain insight into the landscape. This aim is greatly assisted by using a model that is smaller, yet still behaves similarly to more sophisticated implementations. To converge the disconnectivity graph, Fig. 7, in particular the transition states, the model was trained on Ndata = 1000 data points. The results assessing performance, Fig. 8 and 9, were tested on Ndata = 10000 images.
We explored the landscape of this network for several different values of the L2 regularisation parameter λ. The graph shown in Fig. 7, with λ = 0.01, is representative of all the others: we observe a single funnel landscape, where, in contrast to the nonlinear regression example, all of the minima occur in a narrow range of energy (cost) values. This observation is consistent with recent work suggesting that the energies of local minima for neural networks are spaced exponentially close to the global minimum96,97 with the number of variables (number of optimised parameters or dimensionality) of the system.
We next assess the performance profile of the NN minima by calculating the misclassification rate on an independent test set. Judging by average statistics the minima seem to perform very similarly: the fraction f of misclassified test set images is comparable for most of them, with 0.133 ≤ f ≤ 0.162 (mean = 0.148, standard deviation σf = 0.0043). This observation is also consistent with previous results where local minima were found to perform quite similarly in terms of test set error.98,99 When looking beyond average statistics, however, we uncover more interesting behaviour. To this end we introduce a misclassification distance between pairs of minima, which we define as the fraction of test set images that are misclassified by one minimum but not both.100 A value ij = 0 implies that all images are classified in the same way by the two minima; a value ij = maxij = fi + fj implies that every misclassified image by i was correctly classified by j, and every misclassified image by j was correctly classified by i. In Fig. 8 we display the matrix ij, which shows that the minima cluster into groups that are self-similar, and distinct from other groups. So, although all minima perform almost identically when considering the misclassification rate alone, their performance looks quite distinct when considering the actual sets of misclassified images. We hypothesise that this behaviour is due to a saturated information capacity for our model. This small neural network can only encode a certain amount of information during the training process. Since there are many training images, there is much more information to encode than it is possible to store. The clustering of minima in Fig. 8 then probably reflects the differing information content that each solution retains. Here it is important to remember that each of the minima were trained on the same set of images; the distinct minima arise solely from the different starting configurations prior to optimisation.
The misclassification similarity can be understood from the underlying ML landscape. We investigated correlations between misclassification distance and distance in parameter space. In Fig. 9 we display the joint distribution of the misclassification distance and Euclidean distance (L2 norm) between the parameter values for each pair of minima. We see that for a range of values these two measures are highly correlated, indicating that the misclassification distance between minima is determined by their proximity on the underlying landscape. Interestingly, for very large values of geometric distance there is a large (yet seemingly random) misclassification distance.
The seemingly random behaviour could possibly be the result of symmetry with respect to permutation of neural network parameters. There exist a large number of symmetry operations for the parameter space that leave the NN prediction unchanged, yet would certainly change the L2 distance with respect to another reference point. A more rigorous definition of distance (currently unexplored), would take such symmetries into account. There are at least two such symmetry operations.68 The first of these results from the antisymmetry of the tanh activation function: inverting the sign of all weights and bias leading into a node will lead to an inverted output from that node. The second symmetry is due to the arbitrary labelling of nodes: swapping the labels of nodes i and j within a given hidden layer will leave the output of a NN unchanged.
After constructing the network we can analyse properties such as average shortest path length, diameter, clustering coefficients, node degrees and their distribution. For an unweighted and undirected network, the shortest path between a pair of nodes is the path that passes through the fewest edges. The number of edges on the shortest path is then the shortest path length between the pair of nodes, and the average shortest path length is the average of the shortest path lengths over all the pairs of nodes. The diameter of a network is the path length of the longest shortest path. For the network of minima, the diameter of the network is the distance, in terms of edges, between the pair of minima that are farthest apart. The node degree is the number of directly connected neighbours.
For the non-linear regression model, we found 89 minima and 121 transition states. The resulting network consists of 89 nodes and 113 edges (Fig. 10). Although this is a small network we use it to introduce the analysis. The average node degree is 2.54 and the average shortest path length is 6.0641. The network diameter, i.e. the longest shortest path, is 15. Hence, on average a minimum is around 6 steps away from any other minimum, and the pair farthest apart are separated by 15 steps. Thus, a minimum found by a naive numerical minimisation procedure may be on average 6 steps, and in the worst case 15 steps, from the global minimum. Both the average shortest path and network diameter of this network are significantly larger than for a random network of an equivalent size.105
Fig. 10 Network of minima for the nonlinear regression cost function [eqn (8)]. Each node corresponds to a minimum. There is an edge between two minima if they are connected by at least one transition state. The size of the nodes is proportional to the number of neighbours. |
The network of minima for the neural network model in Section V has 142 nodes and 643 edges (Fig. 11). The nodes have on average 9.06 nearest neighbours, the average shortest path is 3.18, and the network diameter is 8. Hence, a randomly selected minimum is on an average only 3 steps away from any other minimum in terms of minimum-transition state-minimum triples. Therefore, on average, the global minimum is only 3 steps away from any other local minimum. The networks of minima defined by molecules such as Lennard-Jones clusters, Morse clusters, and the Thomson problem have also been shown to have small (of order O(log[numberofminima])) average shortest path lengths, meaning that any randomly picked local minimum is only a few steps away from the global minimum.101–103,106 Moreover, these networks exhibit small-world behaviour,107i.e. the average shortest path lengths of these networks are small and similar to equivalent size random networks, whereas their clustering coefficients are significantly larger than those of equivalent size random networks. We have conjectured that the small-world properties of networks of minima are closely related to the single-funnel nature of the corresponding landscapes.108–110 Some networks also exhibit scale-free properties,111 where the node-degrees follow a power-law distribution. In such networks only a few nodes termed hubs have most of the connections, while most other nodes have only a small number.
Fig. 11 Network of minima for the NN model cost function [eqn (8)] applied to digit recognition. The size of the nodes is proportional to the number of neighbours. |
The benefits of analysing network properties of machine learning landscapes may be two-fold. One can attempt to construct more efficient and tailor-made algorithms to find the global minimum of machine learning problems by exploiting these network properties, for example, by navigating the ‘shortest path’ to the global minimum. We also obtain a quantitative description of the distance between a typical minimum from the best fit solution.
In the future, we plan to study further properties of the networks of minima for a variety of artificial neural networks and test the small-world and scale-free properties. We hope that these results may be useful in constructing algorithms to find the global minimum of non-convex machine learning cost functions.
First, we define the notation for the rest of the section. We consider real valued polynomials, H(w), where w is the vector of variables of H; the degree of the polynomial H is p. We define the dimension of the polynomial by the length of the vector w, so if w = (w1,…,wN) then H is an N-dimensional polynomial. A degree p polynomial is homogeneous polynomial if it satisfies the following condition for any real t:
H(tw) = H((tw1,…,twN)) = tpH(w) | (11) |
Having defined the notation, we now clarify the connection between polynomials and spin glasses. Suppose the vector, w = (w1,…,wN), describes the states of N Ising spins that are +1 or −1. Then ∑wi2 = N, so that the distance to the origin is . The continuous analogue of this model is a hypercube embedded in a sphere of radius . Therefore, we can interpret the Hamiltonian of a spherical, p-body, spin glass by a homogeneous polynomial of degree p. This formulation for spin systems has been studied extensively in ref. 112–114. From here on, we will explicitly denote the dimension and the degree of the polynomial in a subscript.
The simplest case is when the degree is p = 1 and the polynomial (Hamiltonian) becomes
(12) |
When p = 2 the polynomial becomes
(13) |
The picture is rather different when we look at polynomials with degree p > 2. When p = 3 the polynomial becomes
(14) |
Fig. 12 Histogram for the energies of points found by gradient descent with the Hamiltonian defined in eqn (14), comparing low-dimensional and high-dimensional systems. −E0 denotes the ground state, and −E∞ denotes the large N limit of the level for the bulk of the local minima. |
To obtain a more precise picture for the energy levels of critical points, we will focus on the level sets of the polynomial and count the number of critical points in the interior of a given level set. Here, the polynomial is assumed to be non-degenerate (its Hessian has nonzero eigenvalues). Then, a critical point is defined by a point whose gradient is the zero vector, and a critical point of index k has exactly k negative Hessian eigenvalues. Finally, our description of the number of critical points will be in the exponential form and in the asymptotic, large N, limit.
Following Auffinger et al., let be the set of points in the domain of the function whose values lie below u. Also, let Ck(Au) denote the number of stationary (critical) points with index k in the set Au. Hence Ck(Au) counts the number of critical points with values below u. Then, the main theorem of ref. 112 produces the asymptotic expectation value for the number of stationary points below energy u:
(15) |
An inspection of the complexity function provides insight about the geometry at the bottom of the energy landscape (Fig. 13). The ground state is roughly at u = −1.657. For u ≥ −E∞ we do not see any local minima, because they all have values that are within the band −E0 ≤ u ≤ −E∞. Moreover, in the same band the stationary points of index k > 0 are outnumbered by local minima. In fact, this result holds hierarchically for all critical points of finite index (recall that the result is asymptotic so that by finite we mean fixing the index first and then taking the limit N → ∞). If we denote the x-axis intercept of the corresponding complexity function Θk as −Ek, with
Θk(−Ek) = 0 for k = 1, 2,… | (16) |
Fig. 13 Plots of the complexity function Θk defined in eqn (15) for local minima, and saddles of index 1, 2 and 50. In the band (−E0, −E50) there are only critical points with indices {1, 2,…, 49}. |
This scenario holds for the p-spin Hamiltonian with any p ≥ 3 where the threshold for the number of critical points is obtained asymptotically in the limit N → ∞. To demonstrate that it holds for reasonably small NFig. 14 shows the results for the p = 3 case with increasing dimensions. The concentration of local minima near the threshold increases rather quickly.
Fig. 14 Empirical variance of energies found by the gradient descent vs. the number of spins of the Hamiltonian defined in eqn (14). |
In Fig. 15 we show disconnectivity graphs for p = 3 spin spherical glass models with sizes N ∈ [50, 100] and fixed coefficients xijk ∼ (0,1). The landscape appears to become more frustrated already for small N, and the range over which local minima are found narrows with increasing system size, in agreement with the concentration phenomenon described in ref. 112.
Interestingly, the concentration of stationary points phenomenon does not seem to be limited to the p-spin system. Related numerical results99 show that the same effect is observed for a slightly modified version of the random polynomial, defined on the three-fold product of unit spheres. We do not yet have an analogue of the above theorem for Ĥ, so guidance from numerical results is particularly useful.
Let D := (x1,y1),…,(xn,yn) be n data points with input xi ∈ N, and label y ∈ ; and let G(w,x) = ŷ describe the output that approximates the label y parametrized by w ∈ M. Further, let (G(w,x),y) = (ŷ,y) be a non-negative loss function. Once we fix the dataset, we can focus on the averaged loss described by
(17) |
We emphasise that the similarity between the p-spin Hamiltonian and the machine learning loss function lies only in the concentration phenomena, perhaps because the two functions share some underlying structure. It is likely that this observation of concentration in the two systems, spin glasses and deep learning, are due to different reasons that are peculiar to their organisation. It is also possible that such concentration behaviour is universal, in the sense that it can be observed in various non-convex and high dimensional systems for which the two systems above are just examples. However, if we wish to describe the ML landscape in terms of glassy behaviour, we might seek justification through the non-linearities in neural networks (the hidden layer in Fig. 1). In some sense, the non-linearity of a neural network is the key that introduces competing forces in which neurons can be inhibitory or exhibitory. This behaviour may introduce a similar structure to the signs of interaction coefficients in spins, introducing frustration. We note that these interpretations are rather speculative at present. Another problem that requires further research is identification of all critical points, not only the ones with low index. A more systematic way to identify the notion of complexity is through finding all of the critical points of a given function. A further challenge lies in the degeneracy of the systems in deep learning. The random polynomials that we considered have non-degenerate Hessian eigenvalue spectra for stationary points at the bottom of the landscape. However, a typical neural network has most Hessian eigenvalues near zero.119
Recently, a complementary study of the minima and saddle points of the p-spin model has been initiated using the numerical polynomial homotopy continuation method,120,121 which guarantees to find all the minima, transition states and higher index saddles for moderate sized systems.116,117 An advantage of this approach is that one can also analyse the variances of the number of minima, transition states, etc., providing further insight into the landscape of the p-spin model. A study for larger values of p, analysing the interpretation in terms of a deep learning network, is in progress.118
The development of general methods for enumerating the number of accessible microstates in a system, and ideally their individual probabilities, is therefore of great general interest. As discussed in Section VII, for a few specific cases there exist methods – either analytical or numerical – capable of producing exact estimates of these numbers. However, these techniques are either not sufficiently general or simply not practical for problems of interest. To date, at least two general and practical computational approaches have been developed. ‘Basin-sampling’55 employs a model anharmonic density of states for local minima organised in bins of potential energy, and has been applied to atomic clusters, including benchmark systems with double funnel landscapes that pose challenging issues for sampling. The mean basin volume (MBV) method developed by Frenkel and co-workers57,123–125 is similar in spirit to basin-sampling, but is based on thermodynamic integration and, being completely general, requires no assumptions on the shape of the basins (although thus far all examples are limited to the enumeration of the minima). MBV has been applied in the context of soft sphere packings and has facilitated the direct computation of the entropy in two123,124 and three57 dimensions. Furthermore, the technique has allowed for a direct test of the Edwards conjecture in two dimensions,126 suggesting that only at unjamming – when the system changes from a fluid to a solid, which is the density of practical significance for many granular systems – the entropy is maximal and all packings are equally probable.
Despite the high computational cost, the MBV underlying principle is straightforward. Assuming that the total configuration volume of the energy landscape is known (simply = VN for an ideal gas of non-interacting atoms), if we can estimate the mean basin volume of all states, the number of minima is simply
(18) |
The computation of the basin volume is performed by thermodynamic integration. In essence, we perform a series of Markov chain Monte Carlo random walks within the basin applying different biases to the walkers and, from the distributions of displacements from a reference point (usually the minimum), compute the dimensionless free energy difference between a region of known volume and that of an unbiased walker. In other words
fbasin = fref + (basin − ref) | (19) |
Recently the approach has been validated when the dynamics determining the basin of attraction are stochastic in nature,129 which is precisely the situation encountered in the training by stochastic optimisation of neural networks and other non-convex machine learning problems. The extension of these techniques to machine learning is another exciting prospect, as it would provide a general protocol for quantifying the machine learning landscape and establishing, for instance, whether the different solutions to learning occur with different probabilities and, if so, what their distribution is. This characterisation of the learning problem could help to develop better models, as well as better training algorithms.
Energy landscape methods may provide a novel way of addressing one of the most intriguing questions in the machine learning research, namely why does machine learning work so well? One way to ask this question more quantitatively is to investigate why we can usually find a good candidate for the global minimum of a machine learning cost function relatively quickly, even in the presence of so many local minima. The present results suggest that the landscape for a range of models are single-funnel-like, i.e. the largest basin of attraction is that of the global minimum, and the downhill barriers that separate it from local minima are relatively small. This organisation facilitates rapid relaxation to the global minimum for global optimisation techniques, such as basin-hopping. Another possible explanation is that many local minima provide fits that are competitive with the global minimum.96,97,99 In fact, these two scenarios are compatible, so that global optimisation leads us downhill on the landscape, where we encounter local minima that provide predictions or classifications of useful accuracy.
The ambition to develop more fundamental connections between machine learning disciplines and computational chemical physics could be very productive. For example, there have recently been many physics-inspired contributions in machine learning, including thermodynamics-based models for rational decision-making,130 generative models from non-equilibrium simulations.131 The hope is that such connections can provide better intuition about the machine learning problems in question, and perhaps also the underlying physical theories used to understand them.
This journal is © the Owner Societies 2017 |