Thermodynamic origins of two-component multiphase condensates of proteins

Intracellular condensates are highly multi-component systems in which complex phase behaviour can ensue, including the formation of architectures comprising multiple immiscible condensed phases. Relying solely on physical intuition to manipulate such condensates is difficult because of the complexity of their composition, and systematically learning the underlying rules experimentally would be extremely costly. We address this challenge by developing a computational approach to design pairs of protein sequences that result in well-separated multilayered condensates and elucidate the molecular origins of these compartments. Our method couples a genetic algorithm to a residue-resolution coarse-grained protein model. We demonstrate that we can design protein partners to form multiphase condensates containing naturally occurring proteins, such as the low-complexity domain of hnRNPA1 and its mutants, and show how homo- and heterotypic interactions must differ between proteins to result in multiphasicity. We also show that in some cases the specific pattern of amino-acid residues plays an important role. Our findings have wide-ranging implications for understanding and controlling the organisation, functions and material properties of biomolecular condensates.


S1 DETERMINING THE CENTRE OF THE DENSITY PROFILE
To calculate the density profiles, the simulation box is divided into 150 bins along the elongated axis, and the average density of each species is calculated within each bin. The 'centre' region of the initial reference system is taken as the region between where the density profiles of the two protein species intersect. The 'vapour' region is then defined as the 50 bins in total where the first and last bin are equidistant from the middle of the 'centre' region. Once these regions are quantified for the initial reference system, we fix the centre of mass of the condensate in our simulations and keep these regions constant throughout the entire genetic algorithm run.   Figure S4. Manner in which the systematic changes are made in the coevolution runs with A1 LCD and its variants. We illustrate the sequence of the protein that is systematically changed to the A1 variant in each round by plotting the value of ,Mpipi for each residue along the sequence.   Figure S7. Changes in the average value of ,Mpipi . These are shown for the evolution runs where (a) the outer sequence is evolved and the inner sequence is kept unchanged and (b) the inner sequence is evolved and the outer sequence is kept unchanged, and for the coevolution runs where A1 LCD is designed to be (c) the inner sequence or (d) the outer sequence in the final multilayered system. The value of ,Mpipi plotted is averaged over all the residues of the evolved sequences in the entire populations in three independent runs for each case. Shaded areas correspond to the standard deviation.  Figure S9. Interaction-energy trends appear to be robust to initial sequence choice. The purple curve is reproduced from Fig. 4. The changes in interaction energies for an evolution run where the outer sequence is evolved and the inner sequence is kept unchanged starting from a different initial system [ Fig. S8], shown in blue, show similar trends to the curves in purple.
evolve I Figure S10. Change in the average value of ,Mpipi . This is shown for the evolution run in which the inner sequence is evolved and the outer sequence is kept unchanged [ Fig. 1 Figure S11. Change in the interaction energies during genetic-algorithm progression for coevolution runs. We show these changes for intermolecular self-interactions between proteins enriched in (a) the outer and (b) the inner layer, (c) cross-interactions between the two proteins and (d) the difference in the O-O and I-I interaction energies for the system with the fittest individual in the coevolution runs where one sequence is changed to A1 LCD and the other is coevolved. Shaded areas correspond to the standard deviation across three independent runs. Although the difference in O-O and I-I interactions decreases overall, this is because of the systematic changes made to the sequence. The trends for the genetic-algorithm rounds in between those at which systematic changes occur are largely consistent with the results shown in Fig. 4. Figure S12. Interfacial free-energy densities. (a) Interfacial free-energy densities as a function of temperature. Error bars correspond to the standard deviation of the interfacial free-energy density computed in several independent sections of the simulation. We use a linear fit to the data points to extract the interfacial entropy and energy for each sequence, as discussed in the main text. (b) Finite-size scaling of the interfacial free-energy densities. We have computed the interfacial free-energy density of the three sequences at 240 K with larger system sizes, varying both the bulk depth (red) and surface area (green). The interfacial free-energy density of the original system size, plotted in (a), is shown in blue. For all three sequences, there is no difference within error bars in the value of the interfacial free-energy density across the different system sizes tested.  Fig. 2(c)]. The remaining columns correspond to different shuffles of the same sequence, as indicated, using the same notation and colours as in the main text. Just above each density profile, we show a map of ,Mpipi values for the sequence in question, as in Fig. 5. In the top left-hand corner of each density plot, we give the fitness value relative to the evolved sequence of the left-most column. In the case of panels (a) and (b), even if amino acids are completely sorted by their interaction strength in the sequence, the multiphasicity is not appreciably reduced, suggesting that patterning is not crucial in these cases. By contrast, even just shuffling the spacers in panel (c) leads to a considerable reduction in fitness, whilst sorting the amino acids results in very different phase behaviour in panels (c) and (d), indicating that patterning is important in these cases. Figure S16. Density profile for sorted partner sequence. Density profile and simulation snapshot of the final evolved system with maximum fitness in the evolution run where the inner sequence is evolved, but with the originally fixed partner sequence O i = (FAFAA) 10 sorted such that residues of the same type are all clustered together. The density profile of the original evolved system is shown by the dotted lines. The number in the top left corner of the density plot is the fitness value relative to the original system with the unsorted partner sequence.

S9 COMPOSITION AND PATTERNING IN COEVOLUTION RUNS
As discussed in the main text, compositional demixing has been found to be favoured in two-component systems with a high 6 charge pattern mismatch. 1,2 Charge patterning can be characterised by the blockiness parameter 3 or by the sequence charge 7 decoration (SCD), defined by 4 where is the charge number of the residue at position along the chain and is the number of amino-acid residues in the protein 9 sequence. To investigate the effect of charge pattern mismatch on multiphasicity, we show in Fig. S17 the difference in SCD 10 between the two protein sequences for the systems in Fig. S15(c) and additional systems with residues of the same charge grouped 11 together. Since the phase behaviour of the systems considered here is not principally charge-driven, it is perhaps not surprising that 12 we find a low correlation between the charge pattern mismatch and the fitness. One thing to note about the patterning parameters SCD and , as well as the sticker pattering parameter aro 5 equivalent of , 14 is that studies investigating patterning based on these parameters usually look at sequences with the same overall composition, and 15 comparing the value of these patterning parameters is less meaningful for sequences of different compositions. 6 Consequently, we 16 have here compared the SCD values for shuffled sequences of the same system with the same overall composition. However, since 17 sequences change over the course of a genetic-algorithm run, it would be rather less straightforward to compare charge patterning.  Fig. S6(a)]. In each case, the intra-species pair correlation function is shown by solid lines in the same colour as the corresponding protein in the density profiles and simulation snapshots, while the inter-species pair correlation function is shown by the black dashed line. For the systems with high multiphasicity, the intra-species ( ) dominates at small over the inter-species ( ), whereas they are comparable for the well-mixed system. The pair correlation function ( ) is calculated as a histogram of the intermolecular distances, and the intermolecular separation is taken to be the shortest of the possible interbead separations, accounting for periodic boundary conditions.  The final evolved sequence with maximum fitness in the two 51 different evolution cases in terms of the approximate interaction 52 strengths of the residues are shown in Fig. S19(f) and Fig. S20(f).

53
The behaviour of these generic sequences across the genetic-  Figure S21. Finite-size scaling (I). Density profiles for the final system obtained in a coevolution run with A1 concentrated in the inner layer [cf. Fig. 2(b)] at two, four and eight times the size of the system reported in the main text, using the same colours and notation as in Fig. 2(b). (a) Density profiles, scaled along the horizontal axis for ease of comparison, are within typical noise across the four different system sizes. The original system size is shown in dotted lines, and systems at two, four and eight times the original size are shown in dashed-dot, dashed and solid lines, respectively. (b) Unscaled density profiles for the four system sizes. The area of the interface is kept constant at 10.9 nm × 10.9 nm, with the long axis increasing from 54.7 nm to 109.4 nm, 218.8 nm and 437.6 nm from top to bottom. The simulation snapshot included at the bottom is for the largest system size considered. For the largest system size considered, we have also shown the length in the direction in terms of the single-molecule g of the outer protein at infinite dilution to illustrate the thickness of the outer phase in terms of the dimensions of a single protein.
[The g of a protein chain of course depends on the environment that the protein is in, so the infinite-dilution g is just a rough approximation. The single-molecule radii of gyration for all the sequences mentioned in the main text are included in the SI alongside the sequence listing.] S14 b a 0 -108.7 108.7 z / nm z / Rg 0 50 50 Figure S22. Finite-size scaling (II). Density profiles for the final system obtained in a coevolution run with A1 concentrated in the inner layer [cf. Fig. S2c] at two, four and eight times the size of the system reported in the main text, using the same colours and notation as in Fig. S2c. (a) Density profiles, scaled along the horizontal axis for ease of comparison, are within typical noise across the four different system sizes. The original system size is shown in dotted lines, and systems at two, four and eight times the original size are shown in dashed-dot, dashed and solid lines, respectively. (b) Unscaled density profiles for the four system sizes. For the first three system sizes, the area of the interface is kept constant at 10.9 nm × 10.9 nm, with the long axis increasing from 54.7 nm to 109.4 nm and 218.8 nm from top to bottom. For the largest system size, to reduce the noise in the density of the inner phase (which needs to be determined more precisely for the computation of Fig. S23(c)), the area of the interface was increased to 15.5 nm × 15.5 nm with the long axis at 217.5 nm to maintain the same overall density in the simulation box. The simulation snapshot included at the bottom is for the largest system size considered. For the largest system size considered, we have also shown the length in the direction in terms of the single-molecule g of the outer protein at infinite dilution to illustrate the thickness of the outer phase in terms of the dimensions of a single protein.
[The g of a protein chain of course depends on the environment that the protein is in, so the infinite-dilution g is just a rough approximation. The single-molecule radii of gyration for all the sequences mentioned in the main text are included in the SI alongside the sequence listing.] , where the degree of multiphasicity is high, the coexisting phases are close to being pure phases. The dashed lines correspond to the density profiles of the pure components equilibrated at the same temperature as the complete multiphasic system shown in solid lines. In (c), where the degree of multiphasicity is lower, the inner phase contains both proteins in mixture (but in different proportions compared to the overall mixture), while the outer phase appears to be made up of mostly O pred , and there is a non-zero density of O pred in the vapour phase. We simulate (i) the inner phase and (ii) the outer phase individually in separate simulations, each in coexistence with the vapour phase. In (i) and (ii), the dashed lines correspond to the density profiles of the individual phases and the horizontal lines are drawn as a guide to indicate which densities we might expect to be the same. In all three cases, the pure-phase coexistence densities are consistent with the overall multiphasic systems.  Figure S24. Fitness function in finite-size scaling. We have computed density profiles for systems obtained in coevolution runs with A1 [cf. Fig. 2] at twice the size of the systems reported in the main text, i.e. with 90 molecules of A1 LCD, 90 chains of 100 residues each of the partner protein in a box of size 10.9 nm × 10.9 nm × 109.4 nm. The resulting density profiles, using the same colours and notation as in [Fig. 2], are within typical noise of the analogues for the original system size (shown with dashed lines, which are scaled by a factor of two along the horizontal direction to ease comparison). The behaviour observed is similar irrespective of how fit the systems are, and the fitness ordering is maintained when the system size is doubled, suggesting that finite-size effects are unlikely to dominate the phase behaviour. changes from the initial sequence in red. The sequences given here correspond to those with the maximum fitness only. In the supporting data, we provide listings of all sequences in the population of the final round of the genetic-algorithm runs.

66
For each sequence, we also report the radius of gyration for a single chain of the protein simulated at 250 K. The standard 67 deviation of the radius of gyration of each protein is no larger than 0.06 nm. 68 S14.0.1 Figure 1 69 The protein O i is (FAFAA) 5 . The protein I i is F 50 with added random mutations with a probability of 0.6.  We list below the sequences of the proteins shown in Fig. 2 of the main text. 72 We first list the sequence of A1 LCD and its variants, using the nomenclature of Bremer and co-workers. 7 The amino-acid residues 73 different from the wild type are highlighted in blue.  Evolution with partner protein fixed (Fig. 3a,b). Here, the protein O i is (FAFAA) 5 . The protein I i is F 50 with added random 78 mutations with a probability of 0.6. Coevolution with A1 LCD (Fig. 3c,d). In coevolution runs where A1 LCD ends up at the centre, we start with I = F 135 and