Equation learning to identify nano-engineered particle-cell interactions: an interpretable machine learning approach

Designing nano-engineered particles capable of the delivery of therapeutic and diagnostic agents to a specific target remains a significant challenge. Understanding how interactions between particles and cells are impacted by the physicochemical properties of the particle will help inform rational design choices. Mathematical and computational techniques allow for details regarding particle-cell interactions to be isolated from the interwoven set of biological, chemical, and physical phenomena involved in the particle delivery process. Here we present a machine learning framework capable of elucidating particle-cell interactions from experimental data. This framework employs a data-driven modelling approach, augmented by established biological knowledge. Crucially, the model of particle-cell interactions learned by the framework can be interpreted and analysed, in contrast to the 'black box' models inherent to other machine learning approaches. We apply the framework to association data for thirty different particle-cell pairs. This library of data contains both adherent and suspension cell lines, as well as a diverse collection of particles. We consider hyperbranched polymer and poly(methacrylic acid) particles, from 6 nm to 1032 nm in diameter, with small molecule, monoclonal antibody, and peptide surface functionalisations. Despite the diverse nature of the experiments, the learned models of particle-cell interactions for each particle-cell pair are remarkably consistent: out of 2048 potential models, only four unique models are learned. The models reveal that nonlinear saturation effects are a key feature governing particle-cell interactions. Further, the framework provides robust estimates of particle performance, which facilitates quantitative evaluation of particle design choices.


Equation learning convergence
We demonstrate that the neural network output matches the test dataset, and that the neural network converges to both the training and validation datasets in Figure 1 for the synthetic dataset generated with Ω s (N, t) = d 0 −d 1 N for different numbers of observations and different levels of noise. The equation learning prediction of the test data and the rate of particle-cell association can be found in Figure 2 in the main document.
We again demonstrate that the neural network output matches the test dataset, and that the neural network converges to both the training and validation datasets in Figure 2 for the synthetic dataset generated with either Ω s (N, (a),(c) Comparison between the neural network prediction of the number of associated particles per cell (orange) and the test dataset (black). (b),(d) Convergence of the neural network to the training dataset (grey) and the validation dataset (purple). The dataset contains 2,500 observations at each of 7 time points with Gaussian noise with a mean of zero and a standard deviation of (a)-(b) 0.5 and (c)-(d) 3.
2(c)-(d)). These results corresponds to the results presented in Figure 3 of the main document, where the equation learning prediction of the test data and association are shown.

Learned model parameters
As discussed in the main document, the learned model can be recast into a form that contains key metrics of particle performance where α is particle-cell affinity (m/s), C 0 is the initial particle concentration (m −3 ), S A is the surface area of the relevant cell type (m 2 ), K is the cell carrying capacity and δ i is the relevant contribution of the ith component, subject to i δ i = 1. The δ i and d i terms are related via The benefit of the recast form is that α and K have been previously identified as relevant metrics of particle performance, and the values of the δ i allows for the relative contributions of the ith order saturation effect to be measured. We present the learned model parameters for the hyperbranched polymer particles in Table  1, and for the poly(methacrylic acid) particles in Table 2. We note that the cell carrying capacity value changes with the particle concentration for the hyperbranched polymer particles, which may suggest that the saturation in the number of associated particles per cell is due to a balance between particle-cell association and either particle recycling or particle degradation; however, we do not speculate on the precise biological mechanism here. Experimental dataset Learned model Affinity (α) Carrying capacity (K)     . We observe that saturation effects are critical, as the rate of particle-cell association decreases with the number of particles per cell in each learned model.

Learned model transferability
A key question in machine learning is whether a learned model can be employed to describe data that the model was not trained on. To examine this, we take the most common learned model (Model 2) and determine whether the model can be used to describe datasets that have not been used previously in this study.  [7]. Full experimental details can be found in the previous literature for the carbon particles study [8], the amorphous silica particles study [10], and the hyaluronic acid MPN particles study [7].
For each dataset we fit Model 2 to the data via nonlinear least squares, and present a comparison between the experimental data and the model prediction in Figure 4. We observe that Model 2 is capable of describing each of the experimental datasets. This suggests that Model 2 is transferable and can be used beyond the datasets on which it was trained. Interestingly, for the datasets in Figures 4(d) and 4(e), Model 1 performs equally well as Model 2. This is consistent with our previous observation that Model 1 can be the most appropriate model to describe particle-cell association data.

Two-stage binding model
The processes by which nano-engineered particles are internalised are complicated, such that a single ordinary differential equation model of particle internalisation is unlikely to capture all of the relevant details. For example, numerous authors have presented models that represent the internalisation process as a twostage process, where a particle first binds to the surface of a cell (via a receptor or a non-specific membrane  . Comparison between experimental data and predictions obtained from fitting Model 2 to the experimental data for (a) carbon particles at 10 µM concentration and MCF7 cells [8]; (b) carbon particles at 25 µM concentration and MCF7 cells [8]; (c) 20 nm amorphous silica particles and MGC80-3 cells [10]; (d) 20 nm amorphous silica particles and HeLa cells [10]; (e) hyaluronic acid MPN particles and MDA-MB-231 cells [7], and; (f) hyaluronic acid MPN particles and BT-474 cells [7]. Blue lines corresponds to model predictions and black dots correspond to the mean of the experimental data. Where available, error bars represent one standard deviation of the mean. binding) and is subsequently internalised [1,2,3,9]. These models can be represented mathematically as dB(t) dt = Ω B (B(t), I(t)) − Ω I (B(t), I(t)), where B(t) is the average number of bound particles at time t, I(t) is the average number of internalised particles at time t, Ω B (B(t), I(t)) is a function that describes the rate of particles binding to the cell and Ω I (B(t), I(t)) is a function that describes the rate of bound particles becoming internalised. Under appropriate choices of Ω B (B(t), I(t)) and Ω I (B(t), I(t)), it is straightforward to solve these equations to obtain predictions of B(t) and I(t). However, identifying Ω B (B(t), I(t)) and Ω I (B(t), I(t)) requires that bound and internalised particles can be distinguished. As various authors have discussed, this is not standard practice, and there are many issues that can be introduced via the choice of experimental techniques employed to distinguish between bound and internalised particles [5,6].
Nonetheless, the difficulty of obtaining experimental data does not preclude examining whether the equation learning framework is capable of identifying the correct model. To achieve this, we can generate synthetic data from the above model, similar to the approach employed in Section 3.1 of the main manuscript. We can use this data to determine whether the equation learning framework can identify the forms of Ω B (B(t), I(t)) and Ω I (B(t), I(t)) when provided with data generated according to those choices of Ω B (B(t), I(t)) and Ω I (B(t), I(t)). We note that this approach will, in general, not work if the only data available is the number of associated particles (i.e. N (t) = B(t) + I(t)). This can be seen through which is independent of the function representing internalisation, Ω I (B(t), I(t)). Moreover, Ω B (B(t), I(t)) may depend on both the number of bound and internalised particles, rather than the sum of the two, and this information is not present in the data.
We modify the equation learning framework such that it constructs m = (n + 1) 2 model components Ω = {ω 1 (B p , I p , t), . . . , ω m (B p , I p , t)}. That is, the model components depend on the observations of both the number of bound and internalised particles. We choose model components that are polynomial terms of the form B p (t) a I p (t) b for 0 ≤ a ≤ n and 0 ≤ b ≤ n. We employ the following assumptions for the two neural networks: • The numbers of both bound and internalised particles are non-negative.
• The rates of particle binding and particle internalisation are non-negative.
We do not include any assumptions about the presence of saturation effects, unlike for the single species model, as it is feasible that the rate of particle internalisation may increase with the number of bound particles. It is straightforward to constrain the neural network to produce such output. We employ a single hidden layer with a sigmoid activation function and non-negative biases and weights, alongside a linear output layer with a non-negative bias and weight [4].

Results
We generate data from the two-stage binding model by first selecting Ω B s (B(t), This represents an internalisation process where particle binding can saturate but particle internalisation does not saturate [1,2,3,9]. As in Section 2.1 in the main manuscript, we generate synthetic data at time points corresponding to every hour over a total of 24 hours. We apply the modified equation learning framework to the synthetically-generated data and present the results in Figure 5. We first see that the learned model describes the test data accurately for both the bound and internalised particles. Crucially, we observe that the model learned from the data is of the correct form, that is, Ω B L (B(t), I(t)) = d 0 − d 1 B(t) − d 2 B 2 (t) and Ω I L (B(t), I(t)) = d 3 B(t) + d 4 B 2 (t).
We next generate data from the two-stage binding model according to Ω B s (B(t), I(t)) = d 0 − d 1 B(t) + d 2 B(t)I(t) and Ω I s (B(t), I(t)) = d 3 B(t) − d 4 B(t)I(t). This represents an internalisation process where particle internalisation can saturate but particle binding does not saturate. We present the results obtained from the modified equation learning framework when applied to this synthetic data. Again, we observe that we obtain a close match between the test data and the learned model. Further, the learned model is of the form Ω B L B(t), I(t)) = d 0 − d 1 B(t) + d 2 B(t)I(t) and Ω I L (B(t), I(t)) = d 3 B(t) − d 4 B(t)I(t), and hence matches the model used to generate the data. This indicates that the modified form of the equation learning framework is able to identify the correct model when data on both internalised and bound particles is available. Importantly, we also see that the equation learning framework can identify models when the constraints on the neural network are relaxed, as we observe that the first derivative of the number of bound particles is non-monotonic (i.e. it both increases and decreases) in Figure 5(c).