Open Access Article

This Open Access Article is licensed under a Creative Commons Attribution-Non Commercial 3.0 Unported Licence

Margherita
Mele
^{a},
Roberto
Covino
^{b} and
Raffaello
Potestio
*^{ac}
^{a}Physics Department, University of Trento, via Sommarive, 14 I-38123 Trento, Italy. E-mail: raffaello.potestio@unitn.it
^{b}Frankfurt Institute for Advanced Studies, 60438 Frankfurt am Main, Germany
^{c}INFN-TIFPA, Trento Institute for Fundamental Physics and Applications, I-38123 Trento, Italy

Received
16th May 2022
, Accepted 26th July 2022

First published on 7th September 2022

The steadily growing computational power employed to perform molecular dynamics simulations of biological macromolecules represents at the same time an immense opportunity and a formidable challenge. In fact, large amounts of data are produced, from which useful, synthetic, and intelligible information has to be extracted to make the crucial step from knowing to understanding. Here we tackled the problem of coarsening the conformational space sampled by proteins in the course of molecular dynamics simulations. We applied different schemes to cluster the frames of a dataset of protein simulations; we then employed an information-theoretical framework, based on the notion of resolution and relevance, to gauge how well the various clustering methods accomplish this simplification of the configurational space. Our approach allowed us to identify the level of resolution that optimally balances simplicity and informativeness; furthermore, we found that the most physically accurate clustering procedures are those that induce an ultrametric structure of the low-resolution space, consistently with the hypothesis that the protein conformational landscape has a self-similar organisation. The proposed strategy is general and its applicability extends beyond that of computational biophysics, making it a valuable tool to extract useful information from large datasets.

Among those systems that most clearly show this behaviour, a prominent example is represented by biological macromolecules such as proteins: these, being composed of several thousands of interacting atoms, display a rich and sophisticated phenomenology over a broad range of length and time scales, which cannot be naively predicted or anticipated from the knowledge of their structure. In order to generate, inspect, and comprehend the properties and behaviour of these systems, computational, in silico methods have been developed, most notably molecular dynamics^{1–4} (MD) simulations, that serve the purpose, among others, of sampling the conformational space of the molecule. Once a dataset of sampled conformations, or frames, is available, however, one faces the problem of extracting useful and intelligible information out of it, separating the relevant feature from the irrelevant detail.

This task can be carried out through dimensionality reduction^{5} or clustering schemes. These methods rely on some notion of similarity – usually a structural similarity – between distinct conformations to group together those whose differences are negligible, while a much larger discrepancy exists from other frames or groups of frames. It might appear desirable to devise these clustering schemes taking advantage of a preexisting knowledge about the system, in order to steer the algorithm towards physically sensible partitions of the sampled conformational space. It can be the case, however, that an undesired bias is introduced in the process, with potentially detrimental consequences for the interpretation of the results; alternatively, one might hope for a completely unsupervised procedure,^{6–8} so as to let the system itself dictate how to cluster its data points, and allow the intrinsic organisation of the conformational space to emerge.

A recently developed information-theoretical approach, the resolution-relevance framework,^{9,10} holds the promise to carry out this task of identifying intrinsically informative low-dimensional representations of the system in an unbiased manner. This approach relies on distinct measures of the information content of a dataset to group the instances of the latter in a way that optimally separates information from noise, and allows the extraction of the largest amount of information about the generative process that underlies the data points. The method, however, operates on the basis of a predefined clustering procedure, whose impact cannot be neglected in the assessment of the resulting partition's quality and physical soundness: in fact, the values of these information metrics for a given arrangement of the data points in clusters only make sense relative to the strategy employed to perform the grouping.

In this work, we tackle the issue of investigating if, and to what extent, different strategies to carry out the clustering of protein MD trajectory frames affect the intrinsic quality of the resulting partitions, and if the resolution-relevance framework can be employed to make sense of these results. We apply this strategy to a dataset of 12 structurally dissimilar proteins as well as to a specific case study, making use of agglomerative clustering strategies with 7 different linkage criteria. Our results support the hypothesis that the resolution-relevance analysis can select those linkage methods giving rise to low-resolution representations of the protein conformational space that reproduce the high-resolution reference with the highest degree of fidelity; furthermore, we propose that this capacity of performing a sensible clustering is a direct consequence of the clustering method being capable of preserving the intrinsically hierarchic structure and ultrametricity of the protein conformational space.

The output of a molecular dynamics simulation consists of a collection of M configurations, or frames, ŝ = (s^{(1)},…, s^{(M)}); these can be thought of as the realisations of a stochastic sampling process, where each element takes the values of one of the possible system states s = (s_{1},…, s_{n}), with n ≫ M. In spite of absolute structural differences, two distinct configurations might result equivalent for a practical purpose; for example, if the relative position of a few atoms in two frames differs by less than a given tolerance, they might be considered essentially equivalently representative of the same overall organisation of the molecule. In analysing the outcomes of a simulation it is thus crucial to filter out redundant details by grouping together structures that can be safely associated to the same state; hence, one has to perform a clustering.

The most trivial level of clustering consists in identifying each frame as a distinct cluster (assuming that no pair of exactly identical configurations exists in the sample). Such a representation clearly allows the highest level of detail in the description of the dataset, but it bears no use in making sense of it; the number of clusters thus has to be reduced, and frames that in principle describe distinct structural organisations have to be grouped together if their distance (as quantified by an appropriate measure) takes values below a predefined threshold. In so doing, the number of clusters is reduced from K = M to values K < M, which correspond to increasingly less resolved representations of the system's configuration space.

For each partition of the dataset it is possible to compute the corresponding values of the aforementioned resolution and relevance. Resolution is defined as (note that we employ logarithms in units of M, or Mats, so that log_{M}M = 1):

(1) |

Since all frames in a cluster are indistinguishable at the level of detail employed, the lowest resolution value H[s] = 0 is obtained when all frames are gathered in the same cluster; similarly, the largest value H[s] = log_{M}M = 1 is attained when each frame is a singleton cluster. Both extremes are equally little informative: on one hand, when the resolution is too low, potentially different conformations are grouped in the same cluster; on the other hand, discriminating all M states as distinct is equivalent to associate to each of them the same probability, which does not provide useful information to infer the underlying generative process. Hence, resolution alone is not sufficient to pinpoint an optimal level of detail at which the system should be inspected, and a second measure has to be employed to this end. Such measure is the relevance H[k], given by:

(2) |

The relevance is null for both extreme values of the resolution: in the case H[s] = 0 all frames are in the cluster with k = M, which gives m_{M} = 1, m_{k≠M} = 0, and hence ; in the case H[s] = 1 all clusters only contain one frame, hence m_{1} = M, m_{k≠1} = 0 and as well.

As the relevance is nonnegative and equal to zero at the extremes of the resolution range, it follows that the relevance as a function of the resolution has to have a maximum; thence, there must be one representation, with intermediate resolution and positive relevance value, that more than the others allows an informative characterisation of the underlying probability distribution.^{14} The partitions at the right of the maximum are in what is called the under-sampling regime, M ≪ n, in which the statistics of the data is relatively poor and several frames associated to distinct states can happen to appear the same number of times. For a given value of the resolution in this region, those partitions that maximise the relevance – the most informative samples – feature a frequency distribution that follows a power law, m_{k} ∼ k^{−μ−1} with μ > 0, such that each value of the frequency is associated to a distinct number of clusters. In particular, the partition for which the quantity H[s] + H[k] is the largest has μ = 1: this corresponds to Zipf's law, m_{k} ∼ k^{−2}, which is associated to the point of optimal tradeoff between parsimony of the representation (low resolution) and its informativeness (high relevance).^{15,16} This is the case, for example, for the frequency of the words in a language,^{17} and the spike patterns of neuron populations^{18} (even though the occurrence of Zipf's law in the latter case, e.g. in neural data from the retina, might be a consequence of the statistics of input – the underlying visual scene – rather than of the neuronal dynamics itself^{19–21}).

In a context of complete ignorance, i.e. in absence of any information about the data except their empirical probability k/M based on some pre-defined classification, the frequency is the only label that can be employed to distinguish between frames in distinct states.^{15,16} The frequency thus constitutes a minimally sufficient representation, which, in absence of additional information about the data, allows one to write the resolution H[s] as:

H[s] = H[k] + H[s|k] | (3) |

This is a crucial aspect, which shows that the implication high relevance therefore informative representation is not necessarily true. The concept of informative representation, obtained as maximization of the relevance, is independent of what the sample represents. Indeed, a random clustering of the system may produce partitions with high relevance values that are informative about some generative process but devoid of any significant information on the specific model producing the data under study (Fig. 1). Consequently, complementing the relevance-resolution framework with a sensible strategy to group elements into clusters based on the physical properties (geometric, structural, energetic, etc.) of the sample, is crucial to steer the generation of empirical probabilities that have maximum relevance consistent with the imposed boundaries.

Fig. 1 Relevance-resolution curves obtained partitioning the simulation data of protein 1DSL with two clustering protocols, average linkage and single linkage. The two panels differ by the atom selection adopted: all atoms on the left and C_{β} atoms on the right. In both panels the random curve is also present, obtained by randomly partitioning the structures into groups (see Methods). Each (H[s], H[k]) point corresponds to a fixed number of clusters in which the frames of the whole trajectory are grouped. The theoretical highest H[k] for given H[s] is also plotted (black dotted line); specifically, the dotted lines show the upper and lower bounds to the theoretical maximum.^{12,13} All curves show the expected characteristic trend: zero relevance at the lowest (all frames in a cluster) and highest (every frame in a single cluster) resolution values. |

In the case of agglomerative clustering of molecular structures, the clustering procedure relies on the specific functions defining the inter-frame and inter-cluster distance (Fig. 1). The former defines the property in terms of which the similarity of two configurations is quantified (structure, compactness, energy, etc.), while the latter is the metric employed to measure the distance between clusters. The latter, which is referred to as linkage criterion, thus determines the protocol employed for agglomerative clustering, and different choices result in different partitions of the system. The ability of a given protocol to return a meaningful partitioning naturally depends on the specific dataset under examination. For example, in single linkage the similarity of two clusters is equivalent to that of their most similar members; this protocol is effective in identifying compact and separate clusters, but it is strongly subject to the chaining effect: two close-by points can form a bridge between two clusters, causing them to merge and resulting in an elongated cluster. Without any prior characterisation of the explored configurational space, the goodness of the partition can be assessed only a posteriori.

In this work, we employed various linkage criteria and investigated the most informative partitions obtained with each of them, with two objectives: first, to identify those linkage methods that are most appropriate for a meaningful and physically sound clustering of molecular structure data; second, to infer general properties of the configurational space explored by proteins in molecular dynamics simulations. In the next section we report the results of our study; the list and definitions of the employed distance measures and linkage criteria are provided in the Appendix.

The data in Fig. 1 suggest that it is possible to characterise the ability of a clustering method to identify informative partitions across various levels of structural detail. Indeed, it can be seen that, although all the curves in the figure were obtained by clustering the same trajectory frames, the MSR values obtained are rather different. In particular, the selection of atoms drastically influences the MSR value obtained through a given clustering protocol. The results of this analysis, carried out for 7 linkage criteria and 12 proteins at varying levels of structural resolution (all, C_{α}, and C_{β} atoms), are summarised in Fig. 2, where the deviation of the MSR relative to the random reference (, see eqn (4)) is plotted against the mean value of the RMSD matrices used for the clustering procedure. Fig. 2 additionally allows comparison of the obtained MSR values with the optimal one, namely the area under the theoretical curve maximizing relevance at each level of resolution (the colored region marks the upper and lower limits of the theoretical maximum^{12,13}). For some linkage criteria, the values are always positive (i.e. larger than the random value MSR_{R}) and close to the optimal one regardless of the system and structural selection employed; for other methods, the performance depends on the system or its representation. In particular, the plot shows that the performance of some methods correlates with the mobility of the protein as quantified by the average RMSD. The linkage criteria can be divided in two groups: those for which the MSR value correlates with the mean value of the RMSD matrix (‘centroid’ and ‘median’) and those for which it does not (‘average’, ‘complete’, ‘weighted’ and ‘ward’). Consequently, it can be argued that the former are less effective and reliable than the latter: in fact, their ability to identify simplified but meaningful representations strongly depends on the specific data under examination, and their applicability is restricted to high-resolution (all-atom) descriptions. This represents a substantial shortcoming, in that these methods are not adequate for the analysis of all-atom data in less detailed terms.

The single linkage criterion shows a peculiar pattern, and deserves to be discussed separately. Here, the distance between clusters is the minimum pairwise distance between their elements. Hence, in the all-atom case, as the matrix elements are widely spread, this algorithm manages to form differently-populated clusters. Conversely, the C_{α} and C_{β} selections implement a coarse-graining that “blurs” the structural differences from the outset; therefore, the algorithm tends to form highly populated clusters by putting together even frames that are relatively different from each other (chaining effect), and provides a rather uninformative representation of the system.

It can be shown^{25} that some of the hierarchical clustering algorithms induce a monotonic hierarchy, i.e. the values in the inter-cluster distance matrix increase monotonically during agglomerative clustering. Algorithms that induce a monotonic hierarchy lead to an ultrametric in the cluster space:^{26} this implies that the metric distance satisfies an inequality stronger than the triangular one.^{27,28} In our analysis, it turns out that clustering protocols that satisfy these qualities coincide with those showing a consistently positive ; the only exception to this trend is single linkage, which, although inducing an ultrametric in cluster space, still shows negative values when coarse-grained representations of the system are employed. In this case, however, the clustering protocol is severely limited by the chaining effect, plausibly producing uninformative partitions of the system and consequently obtaining a lower or comparable MSR value with respect to the random case. Taken together, these results suggest that protein structures sampled in the course of a molecular dynamics simulation populate the configurational space according to an ultrametric structure, which is consistent with the self-similar organisation of the free energy landscape observed in previous works;^{29–33} additionally, the MSR appears to be capable of capturing, in a parameter-free and unbiased manner, the effectiveness of a clustering method in finding informative representations of a biomolecule's configurational variability at different scales of resolution, in that MSR correlates with the method's capacity to preserve the ultrametric structure of the reference configurational space.

Since relevance and resolution are not sensitive to the features of the elements gathered in the clusters and their relative similarity, it is crucial to validate a posteriori that partitions with a higher relevance are indeed more informative than the others. This task can be achieved through dimensionality reduction techniques. In particular, we use diffusion maps,^{34–36} which project the high-dimensional trajectory of the molecule in Cartesian coordinate space onto a low-dimensional manifold of collective coordinates called diffusion coordinates (DCs). We thus performed a comparison of the distribution of points (frames or cluster centroids) in the space spanned by the first two DCs obtained from the high-resolution (HR) or low-resolution (LR) representation of the system (see Methods). It is reasonable to expect that a meaningful partition gathers, in the same cluster, frames close in the HR space, and that the distribution of centroids resembles the HR distribution, thus allowing the same information to be extracted. In order to assess and compare the goodness of partitions we resort to the decomposition of the covariance matrix in its inter- and intra-state contributions (see eqn (5)–(7) in the Methods section). In fact, a key property of an informative LR representation of a system is to capture more information in the retained data than what is left in the discarded ones; we thus expect that the trace of the inter-cluster covariance will be significantly higher than the intra-cluster one.

We thus proceeded to investigate in greater detail the relationship between linkage method and informativeness of the resulting LR representation of a protein's conformational space. To this end, we focused on a specific case study, that of adenylate kinase: the configurations obtained from a 800 ns long simulation, reduced to the positions of the sole C_{β} atoms, were grouped with the single, average, and random clustering methods at various levels of cluster numbers, corresponding to 22 different resolution values in the range [0,1]; the spacing in resolution between the first 7 representations is ∼0.06, for the remaining ones is ∼0.04. For each LR partition we computed, and reported in Fig. 3, the resolution-relevance plots (left panel), the trace of the inter-state and intra-state matrix (middle panel), and the value of the Pearson correlation coefficient (PCC) between the first two DC in the HR and LR space (right panel); the last two sets of quantities are plotted against the number of clusters K employed in the representation.

It is possible to observe that, at a fixed level of resolution, the LR representations obtained through the average linkage are simpler and more informative, as the corresponding number of clusters K is lower and the relevance value higher than those obtained by single linkage; the latter also produces a relevance curve that lies very close to that of the random partition, while the average linkage curve closely approaches the lower bound to the maximum. This hierarchy in performance is also confirmed by the trends of the trace of the covariance matrices and PCC: already with a small number of clusters (K ∼ 10) average linkage identifies LR representations in which the inter-cluster contribution is significantly higher than the intra-cluster one. In contrast, at the same number of clusters the single linkage algorithm produces partitions for which the two terms are of the same order of magnitude, or even ranked oppositely (the intra-state contribution is larger than the inter-state).

These results show that the LR representations obtained with single linkage clustering do not fully capture the information contained in the data and destroy a comparable, or even larger, amount of information than what is maintained. In general, the inter-state (resp. intra-state) contribution to the covariance for average linkage is always significantly higher (resp. lower) than that obtained with single linkage, and the outcomes of the two linkage methods are comparable only when more than half of the frames are retained. Results from dimensionality reduction also support the observation that the average linkage identifies more informative LR representations than those produced by single linkage in that, coherently with the trend of MSR, the PCC value is consistently higher in the former case than in the latter.

In the three panels of Fig. 3 the yellow stars indicate the representations that, for each method, maximise the relevance. Interestingly, in both the graph of covariance matrices and of the PCC, these representations are at the elbow of the curve. Further analyses in support of this interpretation, specifically of the location of the point of slope μ = 1 and of the inter-cluster distance as a function of the cluster number, are provided as ESI.†

The observed behaviour is thus suggestive of the fact that further increases in resolution lead to an increased model complexity that is no longer balanced by information gain: the tradeoff between complexity and informativeness turns in favour of the former, consistently with the interpretation of relevance as a measure of useful information content. The resolution-relevance framework allows for the identification of optimally chosen representative configurations of the dataset; any observable can be computed on these configurations, provided that their statistical weight, proportional to the size of the corresponding cluster, has been accounted for. This results in a significant gain in computational cost: in fact, assuming a linear scaling with respect to the size of the dataset, for the computation of the observable in question there will be a gain proportional to the ratio of the number of data and that of the clusters; in the case of the representation identified by average linkage, this ratio is ∼50. We note, in passing, that the aforementioned strategy is insensitive to any temporal ordering of the data points, if any; hence, in order to perform any measure that relies on such order (e.g. diffusion and transport coefficients) one has to explicitly include this information in the clustering procedure, specifically associating a time stamp to the representative frames and keeping track of the identity of particles.

To gain further insight in the statistical significance of these results, we compared the data obtained for average and single linkage with those of the random clustering. The latter has a very close relevance curve to that of single linkage, and the MSR values associated with single linkage (MSR_{S} = 0.235) and random clustering (MSR_{R} = 0.233) differ only at the third decimal place; in spite of that, the usefulness of the partitions obtained with an information-driven protocol is incomparably greater than that returned by random clustering: the trace of the inter-cluster covariance matrix of the latter is always lower than the intra-cluster one until we consider representations in which about 2/3 of the original frames are preserved, and the PCC between reference and random partition DC is almost zero at any level of resolution. These observations further support the idea that the relevance alone cannot be taken as an absolute measure of the informativeness of a given low-resolution representation, however this quantity in combination with the appropriate classification method proves extremely effective in identifying protocols that maximise the emergence of useful information.

We then looked in detail at the three representations that maximise the relevance for each of the clustering methods under examination. In the right-hand side of Fig. 4 the distributions of centroids in LR representations are compared with the frame distributions in the HR ones, as the points in each panel are coloured according to the value taken from the first DC in the LR representation. A visual inspection of these data shows that the distribution of average linkage centroids in the LR DC space is consistent with that of the HR frames; in both graphs it is possible to recognise a colour gradient along the x axis, showing that neighbouring frames in the HR space are grouped together in the LR space. As for the linkage criterion, the LR representation maximising the relevance produces a slightly different distribution of points than that of the HR frames; furthermore, looking at the colour of points in both spaces it appears that distant frames in the diffusion space are associated to the same cluster. This is even more evident when correlating the values assumed by the DCs in the HR and LR representation, as shown in the bottom-left corner of Fig. 4. For both linkage measures (average and single) it is possible to identify a strong correlation between the first DCs in LR and HR: the Pearson correlation coefficient is 0.95 for average linkage and 0.85 for single linkage; nevertheless, in the case of single linkage, some clusters contain frames with a wide distribution of HR diffusion coordinate values, i.e., frames carrying very different information are mistakenly lumped in the same bin. Last but not least, we observe, as expected, a total lack of correlation – both in terms of point distribution and cluster composition – between the DCs of the random partition and the reference HR DCs.

Finally, a detailed analysis of the relevance-resolution curve, in the upper left corner of Fig. 4, shows that the relevance obtained from average linkage is significantly higher than that obtained from single linkage; additionally, it is comparable with its theoretical maximum. Indeed, the average linkage curve lies close to the area enclosed by the upper and lower limits to the theoretical maximum of the relevance.^{12,13} Since relevance proved capable of capturing the informativeness of representations, one might think of modifying the clustering outcome by shifting points between neighboring groups to further increase the relevance value. In this respect, the results obtained through average linkage would represent an ideal starting point; indeed, being already so close to the theoretical maximum, it would allow to further increase the relevance with perturbative changes that, while violating the rules of the clustering, would preserve its general structure.

To this end, the resolution-relevance framework represents a novel, powerful instrument to construct informative simplified representations of a molecule's conformational space; however, a blind and black-box application of this approach bears the risk of giving high-relevance partitions more credit than they deserve, in that the quality of said partitions cannot be disentangled by the specific classification method employed to construct them.

In the present work we have tackled this issue through the systematic, dataset-wide application of the resolution-relevance framework to a number of structurally distinct proteins, making use of state-of-the-art agglomerative clustering methods. Our results show that the clustering strategies, and more specifically the particular definitions of inter-cluster distance, employed to group together “similar” frames into structurally homogeneous clusters return different values of the multi-scale relevance, a global measure of the relevance at various levels of resolution. We find that the partitions having higher values of the MSR are those that produce the most physically sensible partitions, as quantified in terms of intra- and inter-cluster covariance, as well as the correlation between the collective diffusion coordinates computed in the reference, high-resolution space and those of the low-resolution representation. Most interestingly, a positive correlation emerges between high values of MSR and the efficacy of a clustering method in reconstructing a low-resolution representation that features an ultrametric structure: this observation is suggestive of the fact that the configurational space spanned by a protein in the course of a molecular dynamics simulation is intrinsically organised in a hierarchical manner, which is consistent with the hypothesis, proposed and verified in the literature, that the free energy landscape of proteins is effectively self-similar.

In conclusion, we propose that the clustering method employed in the dimensionality reduction of a dataset could be not only employed as a tool to preprocess the data in order to analyse them, but also treated as an analysis tool itself: in fact, through the joint usage with the general, parameter-free resolution-relevance framework it is possible to discriminate among partitioning approaches that produce low-resolution models more or less representative of the salient qualities of the high-resolution reference. The combination of these algorithms can thus pave the way to an even more fruitful deployment of clustering approaches in computational biophysics, bringing further insight in the behaviour of complex macromolecules.

Two specific proteins were used for the second part of the analysis. The protein adenylate kinase (PDB code AKE4) because of its relatively small size and the possibility to observe conformational transitions over time scales easily achievable by means of plain MD. The second system is the humanised IgG4 monoclonal antibody (PDB code 5DK3). This system was chosen because of its large size and higher structural and dynamical complexity.^{41} As the results obtained in the two cases are consistent, for the sake of clarity we only reported the data pertaining the adenylate kinase in the main text, while those of the antibody are provided as ESI.†

(4) |

C_{ij} = C^{intra}_{ij} + C^{inter}_{ij} | (5) |

(6) |

(7) |

D(C_{i},C_{j}) = min{d(x_{i},x_{j}) for x_{i} ∈ C_{i} and x_{j} ∈ C_{j}} | (8) |

Single linkage clustering tends to produce elongated clusters, which causes the chaining effect. Two points that form a bridge between two clusters cause the single-link clustering to join these two clusters into one.

• Complete linkage: it consider the distance between two clusters to be equal to the largest distance from any member of one cluster to any member of the other cluster.

D(C_{i},C_{j}) = max{d(x_{i},x_{j}) for x_{i} ∈ C_{i} and x_{j} ∈ C_{j}} | (9) |

This procedure tends to form smaller and more compact clusters.

• Average linkage: it considers the distance between two clusters as the average distance between all pairs of points coming from the different groups.

(10) |

• Weighted linkage: also in this case, the protocol takes as cluster distance the average distance from any member of one cluster to any member of the other one. The difference is that the distance between the new cluster and another is weighted with respect to the number of data in each cluster. Consequently, the distance between the cluster C_{k} = C_{i}∪C_{j} and a third cluster C_{l}, not involved in the definition of C_{k}, is:

(11) |

• Centroid linkage: in this case, two clusters are merged based on the distance of their centroids. The definition of centroids is:

(12) |

Consequently, the distance between clusters results the Euclidean distance between the centroids:

D (C_{i},C_{j}) = ‖μ_{i}− μ_{j}‖_{2} | (13) |

The centroid of the resulting cluster C_{k} = C_{i}∪C_{j} is recomputed according to eqn (12) considering all the points belonging to it.

• Median linkage: the procedure is similar to the centroid linkage, except that the centroid of the resulting cluster μ_{k} is the average of the centroid of the merged ones:

(14) |

This is equivalent to giving the same weight to merged clusters regardless of the number of elements in them.

• Ward linkage: the methods aims to minimise the increase of the intra-cluster sum of squared errors:

(15) |

(16) |

Consequently, the distance between the new cluster C_{k} = C_{i}∪C_{j} and an unused cluster C_{l} is given by the recursive equation:

(17) |

All definitions of distance between clusters can be summarised by the recursive relation proposed by Lance and Williams:^{53}

D(C_{l},C_{i}∪C_{j}) = α_{i}D(C_{l},C_{i}) + α_{j}D(C_{l},C_{j}) +βD(C_{i},C_{i}) + γ|D(C_{l},C_{i}) − D(C_{l},C_{j})| | (18) |

- M. Karplus and G. A. Petsko, Nature, 1990, 347, 631–639 CrossRef CAS PubMed.
- M. González, École thématique de la Société Française de la Neutronique, 2011, 12, 169–200 CrossRef.
- A. C. Pan, T. M. Weinreich, S. Piana and D. E. Shaw, J. Chem. Theory Comput., 2016, 12, 1360–1367 CrossRef CAS PubMed.
- S. A. Adcock and J. A. McCammon, Chem. Rev., 2006, 106, 1589–1615 CrossRef CAS PubMed.
- G. A. Tribello and P. Gasparotto, Front. Mol. Biosci., 2019, 6, 46 CrossRef CAS PubMed.
- A. Glielmo, B. E. Husic, A. Rodriguez, C. Clementi, F. Noé and A. Laio, Chem. Rev., 2021, 121, 9722–9758 CrossRef CAS PubMed.
- F. Noé and C. Clementi, Curr. Opin. Struct. Biol., 2017, 43, 141–147 CrossRef.
- A. Glielmo, C. Zeni, B. Cheng, G. Csanyi and A. Laio, arXiv preprint arXiv:2104.15079, 2021, 8.
- C. Battistin, B. Dunn and Y. Roudi, Curr. Opin. Syst. Biol., 2017, 1, 122–128 CrossRef.
- M. Marsili and Y. Roudi, Phys. Rep., 2022, 963, 1–43 CrossRef.
- S. Grigolon, S. Franz and M. Marsili, Mol. BioSyst., 2016, 12, 2147–2158 RSC.
- M. Marsili, I. Mastromatteo and Y. Roudi, J. Stat. Mech.: Theory Exp., 2013, 2013, P09003 CrossRef.
- A. Haimovici and M. Marsili, J. Stat. Mech.: Theory Exp., 2015, 2015, P10013 CrossRef.
- J. Song, M. Marsili and J. Jo, J. Stat. Mech.: Theory Exp., 2018, 2018, 123406 CrossRef.
- R. J. Cubero, M. Marsili and Y. Roudi, Entropy, 2018, 20, 755 CrossRef PubMed.
- R. J. Cubero, J. Jo, M. Marsili, Y. Roudi and J. Song, J. Stat. Mech.: Theory Exp., 2019, 2019, 063402 CrossRef.
- G. K. Zipf, Selected studies of the principle of relative frequency in language, Harvard university press, 2013 Search PubMed.
- G. Tkačik, T. Mora, O. Marre, D. Amodei, S. E. Palmer, M. J. Berry and W. Bialek, Proc. Natl. Acad. Sci. U. S. A., 2015, 112, 11508–11513 CrossRef PubMed.
- J. Tyrcha, Y. Roudi, M. Marsili and J. Hertz, J. Stat. Mech.: Theory Exp., 2013, 2013, P03005 CrossRef.
- D. J. Schwab, I. Nemenman and P. Mehta, Phys. Rev. Lett., 2014, 113, 068102 CrossRef.
- L. Aitchison, N. Corradi and P. E. Latham, PLoS Comput. Biol., 2016, 12, e1005110 CrossRef PubMed.
- M. I. Ionescu, Proteins, 2019, 38, 120–133 CrossRef CAS PubMed.
- E. Formoso, V. Limongelli and M. Parrinello, Sci. Rep., 2015, 5, 1–8 Search PubMed.
- R. J. Cubero, M. Marsili and Y. Roudi, J. Comput. Neurosci., 2020, 48, 85–102 CrossRef.
- G. W. Milligan, Psychometrika, 1979, 44, 343–346 CrossRef.
- S. C. Johnson, Psychometrika, 1967, 32, 241–254 CrossRef CAS PubMed.
- N. Jardine and R. Sibson, Math. Biosci., 1968, 2, 465–482 CrossRef.
- H. Fushing, H. Wang, K. van der Waals, B. McCowan and P. Koehl, PLoS One, 2013, 8, e56259 CrossRef CAS PubMed.
- F. Pontiggia, G. Colombo, C. Micheletti and H. Orland, Phys. Rev. Lett., 2007, 98, 048102 CrossRef CAS PubMed.
- A. Volkhardt and H. Grubmüller, Phys. Rev. E, 2022, 105, 044404 CrossRef CAS.
- M. J. Pandya, S. Schiffers, A. M. Hounslow, N. J. Baxter and M. P. Williamson, Front. Mol. Biosci., 2018, 5, 115 CrossRef CAS PubMed.
- K. A. Henzler-Wildman, M. Lei, V. Thai, S. J. Kerns, M. Karplus and D. Kern, Nature, 2007, 450, 913–916 CrossRef CAS PubMed.
- K. Henzler-Wildman and D. Kern, Nature, 2007, 450, 964–972 CrossRef PubMed.
- J. De la Porte, B. Herbst, W. Hereman and S. Van Der Walt, Nineteenth Annual Symposium of the Pattern Recognition Association of South Africa, 2008, 15-25.
- S. Lafon and A. B. Lee, IEEE Trans. Pattern Anal. Mach. Intell., 2006, 28, 1393–1403 Search PubMed.
- B. Nadler, S. Lafon, R. R. Coifman and I. G. Kevrekidis, Appl. Comput. Harmon. Anal., 2006, 21, 113–127 CrossRef.
- T. Tarenzi, G. Mattiotti, M. Rigoli and R. Potestio, Appl. Sci., 2022, 12, 7157 CrossRef CAS.
- C. Micheletti, P. Carloni and A. Maritan, Proteins: Struct., Funct., Bioinf., 2004, 55, 635–645 CrossRef CAS.
- R. Potestio, T. Aleksiev, F. Pontiggia, S. Cozzini and C. Micheletti, Nucleic Acids Res., 2010, 38, W41–W45 CrossRef CAS PubMed.
- A. Amadei, M. A. Ceruso and A. Di Nola, Proteins: Struct., Funct., Bioinf., 1999, 36, 419–424 CrossRef CAS.
- T. Tarenzi, M. Rigoli and R. Potestio, Sci. Rep., 2021, 11, 1–12 CrossRef PubMed.
- H. Bekker, H. Berendsen, E. Dijkstra, S. Achterop, R. Vondrumen, D. Vanderspoel, A. Sijbers, H. Keegstra and M. Renardus, 4th International Conference on Computational Physics (PC 92), 1993, pp. 252–256.
- M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess and E. Lindahl, SoftwareX, 2015, 1, 19–25 CrossRef.
- K. Lindorff-Larsen, S. Piana, K. Palmo, P. Maragakis, J. L. Klepeis, R. O. Dror and D. E. Shaw, Proteins: Struct., Funct., Bioinf., 2010, 78, 1950–1958 CrossRef CAS PubMed.
- R. W. Hockney and J. W. Eastwood, Computer simulation using particles, CRC Press, 2021 Search PubMed.
- G. Bussi, D. Donadio and M. Parrinello, J. Chem. Phys., 2007, 126, 014101 CrossRef PubMed.
- M. Parrinello and A. Rahman, J. Appl. Phys., 1981, 52, 7182–7190 CrossRef CAS.
- B. Hess, H. Bekker, H. Berendsen and J. Fraaije, LINCS: A linear constraint solver for molecular simulations, J. Comput. Chem., 1997, 18, 1463–1472 CrossRef CAS.
- R. T. McGibbon, K. A. Beauchamp, M. P. Harrigan, C. Klein, J. M. Swails, C. X. Hernández, C. R. Schwantes, L.-P. Wang, T. J. Lane and V. S. Pande, Biophys. J., 2015, 109, 1528–1532 CrossRef CAS.
- J. K. Leman, B. D. Weitzner, S. M. Lewis, J. Adolf-Bryfogle, N. Alam, R. F. Alford, M. Aprahamian, D. Baker, K. A. Barlow and P. Barth, et al. , Nat. Methods, 2020, 17, 665–680 CrossRef CAS PubMed.
- F. Pontiggia, A. Zen and C. Micheletti, Biophys. J., 2008, 95, 5901–5912 CrossRef CAS PubMed.
- A. Kitao, S. Hayward and N. Go, Proteins: Struct., Funct., Bioinf., 1998, 33, 496–517 CrossRef CAS.
- G. N. Lance and W. T. Williams, Comput. J., 1967, 9, 373–380 CrossRef CAS.

## Footnote |

† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2sm00636g |

This journal is © The Royal Society of Chemistry 2022 |