Phase and interface determination in computer simulations of liquid mixtures with high partial miscibility

Partially miscible solutions can represent a challenge from the computer simulation standpoint, especially if the mutual solubility of the components is so large that their concentrations do not change much from one phase to another. In this case, identifying which molecules belong to which phase becomes a complicated task. Here, we propose a density-based clustering approach with self-tuning capabilities and apply it to the case of the mixture of an ionic liquid with benzene. The almost linear scaling of the algorithm makes it suitable for the analysis of long Molecular Dynamics or Monte Carlo trajectories.


Introduction
The physics and chemistry of fluids and soft materials are strongly influenced by the presence of interfaces separating different phases or micro-phases, 1-3 and many effects that are relevant for determining the macroscopic behaviour of these systems are occurring right at these interfaces, often within the first few molecular layers. 4 Being able to determine and selectively study the molecules at interfaces is therefore very important in order to improve our understanding of multiphase systems. Computer simulations have been instrumental in changing our picture of fluid interfaces, showing that, for example, far enough from critical points, their local structure is not at all characterized by the slowly varying, sigmoidal-shaped density profile predicted by mean field theories. 5 Rather, the structure of liquid interfaces presents features that closely resemble the molecular pair correlation function. Thermal capillary waves, in fact, smear this sharply structured, intrinsic profile into a smooth function that bridges the two bulk phases across the interface. There are approaches, such as that of Berkowitz, 6 that estimate the average distribution of surface atoms in the reference frame of the simulation box. However, one needs to have access to the set of surface atoms in every frame 7 in order to compute intrinsic quantities.
Determining the interface between two phases in an atomistic computer simulation, however, often requires a priori knowledge of which molecules belong to which phase. Far enough from critical points (be that the liquid-vapour critical point for single component systems, or the critical mixing point in bicomponent systems), the problem of distinguishing the two phases is less severe: simulating water/vapour coexistence at a temperature of 300 K in a simulation box with an edge of about 10 nm, one can hardly find a single molecule in the vapour phase. If the vapour phase is composed of few molecules, a possible way to separate the two phases consists of defining clusters of molecules that are within a given cutoff distance from each other. If the cutoff distance corresponds to the location of the first minimum of the liquid-phase radial distribution function, the liquid phase can be reasonably defined as the largest cluster in the system. 8 This approach fails if the density of the two phases becomes too similar, as this clustering approach will identify a single, large, connected cluster spanning the complete simulation box. Another, more robust approach consists of considering the liquid phase as the largest cluster of molecules, which have at least a given number of neighbours. 7,9 Also this criterion has its drawbacks, as we will show, as it does not consider points with smaller local density, which are still reachable, within the cutoff distance, from the denser regions.
The equivalent situation for bicomponent systems is that of partial miscibility, where the two components A and B can give rise to an A-rich phase F A and a B-rich phase F B , where the respective concentrations are c A (F A ) 4 c A (F B ) and c B (F A ) o c B (F B ). The solubility of A in B does not need to be the same as that of B in A, which can lead to extreme situations in which one phase is a single component one (say, c A (F B ) = 0), while in the other phase the two components have similar concentrations, This condition is exemplified by the case of the ionic liquid 1-butyl-3-methylimidazolium bis(trifluoromethanesulfonyl)imide (BMIM NTf 2 ) in combination with the simplest aromatic solvent, benzene. The concentration of BMIM NTf 2 in the benzene-rich phase is negligible, while in the ionic liquid-rich phase, benzene is present in an extremely high concentration that reaches a molar ratio c benzene /c total of 0.82. 10 Previously, we investigated the structure of fluid interfaces between ionic liquids and other fluids with little or moderate mutual miscibility. 11 In particular, we were interested in how the intrinsic interface structures of two of the most popular ionic liquid compounds, BMIM NTf 2 and 1-butyl-3methylimidazolium hexafluorophosphate (BMIM PF 6 ), change upon varying the polarity of the opposite fluid phase. The interfacial structure of the mixture of BMIM NTf 2 and benzene, however, cannot be studied using the approaches used so far, because of the peculiar mixing properties just described. This is evident in Fig. 1, where we report a simulation snapshot of the mixture of BMIM NTf 2 and benzene at room temperature. Benzene molecules are represented using a contour joining the six carbon atoms, while atoms of BMIM NTf 2 are represented with a space-filling model using blue spheres (Fig. 1, left panel). The ionic liquid is clearly separating into two phases, one of which does not show the presence of any BMIM + or NTf 2 À ions. Benzene, on the other hand, is present throughout the simulation box, as one can appreciate by removing the ionic liquid molecules (Fig. 1, right panel), forming a percolating system that cannot easily be separated into two phases using the simple clustering criterion mentioned before.
However, the peculiar properties of this mixture make it a perfect candidate for testing new algorithms that work closer to critical points. This is because the well separated BMIM NTf 2 phase can serve as a clear reference for the other component. One can thus test any algorithm that aims to separate the high benzene concentration phase from the low benzene concentration one by checking that the former is complementary to the ionic liquid rich region.

A density-based clustering approach
One of the most widespread clustering algorithms for class identification in spatial databases is the Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. 12 According to the basic idea of DBSCAN, the so-called core points of a set are those with, within a cut-off r c , at least N c neighbours or, equivalently, a given local threshold density r c = 3/(4pr c 3 )N c ; directly density-reachable points are then those found within r c from core points. Two points are said to be density-connected if both can be reached, starting from the same core point, through a chain of directly density-reachable intermediate points. A cluster is finally defined as the set of all points, which are density-connected to one core point. It is clear that the condition N c = 1 yields the simple clustering algorithm mentioned in the introduction, which is therefore a special case of DBSCAN. A clustering algorithm which is limited to core points, therefore not including densityreachable ones, has been used by Chacón and Tarazona to separate small vapour clusters from the liquid phase. 7,9 From the point of view of computational complexity, DBSCAN can be performed in a time O(N log N), 12 where N is the number of points, because the core points and the directly densityreachable points associated with them can be identified using, for example, a kd-tree based search that takes (on average) O(log N) iterations. 13 In Fig. 2 we show the result of DBSCAN filtering applied to a set of points on the x-y plane, whose coordinates are sampled from two uniform random distributions (upper panel), a more dense one in the interval x from 0 to 1, and a less dense one (1/3 relative density) for x from 1 to 2. Along the y axis, all coordinates are sampled from a uniform random distribution from 0 to 1. Periodic boundary conditions are applied along the y axis. For this example, r c = 0.18 and N c = 24. The DBSCAN filtering (Fig. 2, lower panel) results in the identification of the largest cluster composed of core points (red points) and density-reachable points (black circles), which almost coincide with the set of points sampled at higher density (x o 1). Note that although the algorithm does fairly well in recovering the two initial distributions, it is not expected (and in our case, not meant) to do so: at the interface, x C 1, the juxtaposition of the two distributions creates regions of high local density for x 4 1, which are reachable from the main cluster. Conversely, regions of low local density for x o 1 can occur just because of the random nature of the distribution and, if they occur close to x = 1, they can become part of some of the smaller clusters. The importance of considering not only core points (that is, those with a local density higher than the chosen threshold) but also the density-reachable ones can be appreciated by noticing, in Fig. 2, the number of non-core points lying well within the high-density region.
We tested the performances of the algorithm by applying it to similar distributions in the three-dimensional space, by varying the number of points N points from 500 to 5 Â 10 6 . The time it takes to perform the filtering, shown in Fig. 3 as blue circles, is virtually indistinguishable from a linear function of N points (dashed line). The prefactors involved in the calculation are also relatively low, as indicated by the timing of about 1 s for 1 million points.

Automating the threshold determination
The almost linear scaling of DBSCAN makes it suitable for processing large datasets such as molecular dynamics trajectories. However, differently from the simple neighbour clustering algorithm, two parameters must be provided (r c and r c ), rather than simply the cutoff radius r c . The choice of this pair of parameters is crucial, as they determine whether the algorithm will be able to distinguish correctly the two phases or not. The identification of proper values of r c and r c should be the outcome of a simple procedure, if not of an automated one, in order for the algorithm to be of practical use.
A common approach in the simple neighbour clustering algorithm is to use as a cutoff radius the distance at which the first minimum of the radial distribution function (of the dense phase) is located. In this way, two molecules are considered to be in the same cluster if they are located at a distance, which would make them (by definition) belong to each other's shell of first neighbours. In the case of the BMIM NTf 2 benzene mixture, benzene separates in two regions, where its concentration changes from about 7 to 3 molecules per nm 3 (see Fig. 4). The intermolecular radial distribution function for the carbons of benzene rings in the high and low benzene concentration regions are shown in Fig. 5. In the high concentration region, a clear, deep minimum is present at around 0.8 nm, but a less pronounced one appears not far from the close contact distance, at about 0.55 nm. Note that the first peak of the radial distribution function in the high benzene concentration region is lower than the corresponding one in the low benzene concentration region: this could seem counterintuitive at first, but one should not forget that the functions are normalized to their respective bulk concentrations, which have a ratio of 3 : 7. Therefore, the local concentration is always smaller, in absolute terms, in the low benzene concentration region than in the high one.
To understand which approach could be best suited for choosing the cutoff density r c , we calculated the distribution of the number of neighbouring molecules for different   cutoff radii, r c , calculated over the whole trajectory and over all benzene molecules in the simulation box. The results are presented in Fig. 6. When using a small cutoff radius, r c = 0.45 nm, the distribution of the number of neighbours N n shows a shoulder at about N n = 5, hinting at the presence of an underlying multimodal distribution, which cannot be properly resolved. By increasing the cutoff to r c = 0.6 and 0.75 nm, the distribution stretches to a higher number of neighbours, as the available volume increases, and, at the same time, the presence of two peaks is gradually revealed. The two peaks, corresponding to the density distributions in the two concentration regions, are best resolved when a large cutoff is being used. We have chosen to characterize the distribution using a k-means clustering algorithm 14 with two modes, which provides two centroids (represented by two circles in Fig. 6 for r c = 0.75 nm). From the average of the positions of the two centroids (denoted by a diamond symbol in Fig. 6) one can obtain a possible estimate of the threshold density r c , as below/above the corresponding number of neighbours most of the distribution is contributed to by atoms in the low/high benzene concentration region, respectively.
Due to the non-miscibility of the ionic liquid in benzene, no ionic liquid should be found in the benzene-rich phase determined by the algorithm. This sets a fairly restrictive criterion to be met by the algorithm. However, by using the midpoint of the two centroids, about 5% of the total number of ionic liquid molecules (51 ions out of about 1024 in the system) can be found within the high benzene concentration region, showing that this choice of r c is not adequate. The reason for this behaviour stems from the fact that by placing the cutoff at the midpoint of the two centroids, some benzene molecules in the right tail of the distribution centred at the lower concentration are identified as belonging to the high concentration region. By using a cutoff concentration corresponding to the upper centroid, instead of the midpoint, the proportion of ions found in the high benzene concentration region drops to about 0.5%. As there are 512 ion pairs in the system, this means that on average about one ion pair (per side) is found across the benzene interface. The choice of the upper centroid concentration as a threshold one does, therefore, improve noticeably the quality of the phase identification. This can be appreciated by inspecting the intrinsic density profile, 9 calculated by constructing the histogram of the distance of molecules from the local position of the interface x(x, y), as as presented in Fig. 7. The virtual absence of ions inside the benzene rich phase shows that this automatic choice for the threshold density r c is in fact able to distinguish the two phases in a consistent way. The intrinsic density profile shows another interesting feature of the mixture, namely, the presence of a depletion layer of benzene next to the interface, and a corresponding increase in the concentration of the ionic liquid, which cannot be noticed in the non-intrinsic density profile.   Another important requirement for the algorithm is its stability against small (or large) variation of r c , the only free parameter left (as r c can be determined automatically). Fig. 8 reports, for the same configuration, but for different values of r c , the molecules belonging to the largest cluster (grey) and to the other, smaller clusters (blue). In addition, we have run an interfacial analysis using the Identification of Truly Interfacial Molecules algorithm (ITIM), 15 to inspect also the features of the interface between the high and low benzene concentration regions. The atoms belonging to the interface (but still part of the high benzene concentration region) are represented in red. Visual inspection shows that the DBSCAN-based algorithm succeeds in identifying consistently the two regions for all cutoff values, even when the distributions of neighbours are strongly overlapping (see Fig. 6). For cutoff values lower than 0.5 nm, however, the algorithm seems to identify a larger number of benzene molecules, with some protrusion into the ionic liquidrich phase noticeable in the left interface.
To quantify the difference between the phases obtained with different cutoff values, as well as the differences between the corresponding interfaces, we calculated the number of molecules belonging to the largest cluster (i.e., the benzene-rich phase) and the interface, as a function of r c , and presented this information in Fig. 9. Two main trends can be seen, namely, a sharp decrease in the number of surface molecules and an increase in the number of molecules in the largest cluster when the cutoff increases from 0.35 to about 0.6 nm. After this point, the number of surface molecules and the size of the biggest cluster change only slightly, and the dependence on the cutoff value suggests the presence of a minimum value for the number of surface molecules. Intuitively, one would consider as an optimal choice for r c the one that minimizes the surface area 7 or, in this case, the number of surface molecules, which is roughly proportional to it. In principle, one could point out that from r c = 1 nm up to the largest value r c = 1.45 nm, the number of surface molecules differs, on average, only by about 5 surface molecules (roughly 1% of the total). These small variations are probably not consequential for many practical applications, and any cutoff radius above 1 nm would be an equivalently good choice. One should bear in mind, however, that the larger the cutoff radius, the more computationally intensive the DBSCAN filtering will be.
The solubility of benzene, defined as the ratio of the benzene molar concentration and total molar concentration in the ionic-liquid rich phase, can be accessed either by counting the number of benzene molecules not in the largest cluster, or through a fit in the bulk region of the density profile. The two values (0.70 AE 0.01 and 0.69 AE 0.01, respectively) compare reasonably well with the experimental solubility of benzene in BMIM NTf 2 , 0.82 measured under the same thermodynamic conditions. 10 It is worth mentioning that the algorithm is not restricted to planar phase separations, but can be applied without any modification in the presence of corrugated interfaces, or when one of the phases is dispersed into the other one, such as droplets of oil dispersed in a water/oil mixture, as long as the cutoff radius r c is smaller than that of the droplets.
One might wonder how other similar algorithms compare to the present one in determining the phases of a partially miscible system, or, conversely whether the present approach is appropriate also in the fully demixed case. To test similar algorithms we investigated the BMIM NTf 2 /benzene system using the ITIM algorithm with simple clustering, and the Generalized ITIM (GITIM) algorithm. 16 Simulation snapshots resulting from these tests are shown in the ESI. † In particular, it clearly emerged that the use of a simple clustering scheme does not allow us to recognize properly the two phases, to the extent that it is impossible for the ITIM algorithm to distinguish upper and Fig. 8 Benzene molecules in one simulation snapshot assigned to different phases (gray: benzene-rich phase; red: interfacial molecules of the benzene-rich phase; blue: ionic liquid-rich phase) using different values for the cutoff of the DBSCAN algorithm. From top to bottom: cutoff of 0.4, 0.6, 0.8, and 1.2 nm. Notice that the system is translated so that the centre of mass of the largest cluster is located in the middle of the box. lower sides of the interface. With the simple clustering scheme, all molecules in the simulation box are assigned to the highdensity phase, and those tagged as interfacial are not representative at all of the real surface molecules. The GITIM algorithm performs better, as it is able to reveal the small empty pockets, which are abundant in the low benzene concentration phase. The large majority of benzene molecules in the latter phase, however, are identified as surface molecules, and some additional kind of filtering would be required to properly separate the two phases.
The opposite limit of a well-defined phase separation with vanishing vapour pressure is also instructive. We have considered a water/vapour interface formed by 1000 water molecules at 300 K, simulated using the SPC/E water model, 17 and assigned molecules to the two phases using both the simple clustering scheme and the present method. At a temperature of 300 K, in most frames not a single water molecule is clearly seen in the vapour phase, and in these cases the simple clustering approach assigns all water molecules to the liquid phase. The distribution of the number of neighbours, however, is in this case no longer bimodal, which was a prerequisite for using the automatic determination of the threshold density r c . Setting by hand r c to, for example, one water molecule every 35 Å 3 (therefore slightly lower than the bulk number density), the density based clustering algorithm also assigns all molecules to the liquid phase. If, instead, the upper centroid is used to set r c , the following is observed: for a cutoff value of 0.8 nm, some molecules in the outer part of the first molecular layers are identified as vapour ones; by increasing the cutoff value the number of molecules recognized as vapour ones decreases, and for cutoff values larger than 1.5 nm, they completely disappear (simulation snapshots of the water/vapour interface are also shown in the ESI †).
We can therefore conclude that in the extreme case when no or very few molecules are present in the low-concentration phase, the automatic threshold determination should be avoided, in favour of either choosing manually the threshold density, or using the simple clustering method. Either way, the same result is obtained.

Methods
Molecular dynamics simulations of the BMIM NTf 2 /benzene system were carried out with the GROMACS 4.6.7 program package on the isobaric-isothermal ensemble under ambient conditions (at 298.15 K and 1 bar). 18 During simulations, the cross-sectional area of the rectangular simulation box was kept fixed (5.7 nm Â 5.7 nm) and only the direction perpendicular to the plane of the interface was allowed to change (fluctuating around the value of 18.51 nm) according to the prescribed pressure. To simulate the isobaric-isothermal ensemble we applied the Nosé-Hoover thermostat 19,20 in combination with the Parrinello-Rahman barostat, with relaxation constants of 1.0 ps and 2.0 ps, respectively. 21 To model BMIM NTf 2 , the force field published by Logotheti et al. 22 was applied, while benzene was described by the general OPLS force field, including all hydrogen atoms. 23 The Lennard-Jones interaction parameters between the ionic liquid and benzene were determined using Lorentz-Berthelot mixing rules. All bond lengths were kept fixed using the LINCS algorithm 24 to make sure we can safely use a time step of 2 fs for the integration of the equations of motion. Non-bonded interactions were truncated beyond 1.2 nm, while the electrostatic interaction was calculated using the PME algorithm 25 in its smooth variant. 26 Analytical tail corrections for the energy and pressure of the dispersion interactions were also applied. Periodic boundary conditions were applied in all directions. The simulated system contained 512 ion pairs as well as 2500 benzene molecules. Starting from totally unmixed phases, an equilibration was performed for 600 ns. This long equilibration time was needed because of the peculiar mixing behaviour of the system (which requires transport of a large amount of mass through the simulation box), combined with the high viscosity of the mixture, which sets the relaxation time. During the 600 ns equilibration, quantities like the total energy and the concentrations in the two phases (that is, also the chemical potential) relaxed to their equilibrium values. After the equilibration, we sampled configurations from an additional 30 ns long run, saving them to disk for further analysis.
The algorithm for the phase determination was implemented in python/Cython, 27 only slightly modifying the DBSCAN class of the scikit-learn machine learning package 28 to provide also the sizes of clusters and to take into account the presence of periodic boundary conditions. The latter requirement has been fulfilled by using the periodic boundary condition-aware kd-tree neighbour search implemented in recent versions (Z0.18) of scipy. The configurations from the stored trajectories were made accessible to the python script using the MDAnalysis library. 29,30

Conclusions
We have presented a computational approach that is able to tell apart molecules belonging to different phases in solutions with high mutual solubility. The method is based on the DBSCAN clustering algorithm, and introduces an automatic choice for the threshold concentration, which is one of the two parameters needed by DBSCAN. The approach has been shown to be robust against the choice of the remaining input parameter, namely, the distance cutoff used to calculate the local concentration. We have applied the method to a partially miscible solution of an ionic liquid (BMIM NTf 2 ) with benzene.
This choice has been dictated by two factors, namely (a) the high miscibility of benzene, which yields equilibrium concentrations in a ratio of about 3 : 7 and (b) the virtually zero miscibility of the ionic liquid. On the one hand, the high miscibility of benzene provided a challenging test case, as the equilibrium concentrations are very close to each other, and the interface between the high and low benzene concentration regions is difficult to locate. On the other hand, the low miscibility of the ionic liquid has, as a consequence, the presence of a very well defined interface between the ionic liquid-rich and poor regions. This provides a remarkable benchmark for the identification of the interface, as the phases as defined using the ionic liquid or benzene are not supposed to interpenetrate in a consistent scheme. In fact, only about 1 ion pair is found, on average, to cross the interface and be found in the high benzene concentration region. This shows that the algorithm is in fact able to identify a meaningful interface also when the miscibility of the components is very high. The robustness against the only remaining free parameter of the method, and the quasi linear scaling as a function of the number of atoms in the system, therefore make it a viable approach for the identification of phases in multicomponent systems with high miscibilities. The code is available free of charge as part of the pytim package at https://github.com/Marcello-Sega/pytim.