Nicole
Vijgen
,
Karsten M.
Poulsen†
,
Gustavo Sosa
Macias‡
and
Christine K.
Payne
*
Thomas Lord Department of Mechanical Engineering and Materials Science, Duke University, Durham, North Carolina 27708, USA. E-mail: christine.payne@duke.edu
First published on 29th July 2025
Nanoparticles (NPs) present in any biological environment form a “corona” of proteins on the NP surface. This protein corona, rather than the bare NP, determines the biological response to the protein–NP complex. Experiments, especially proteomics, can provide an inventory of proteins in the corona, but researchers currently lack a method to predict which proteins will interact with NPs. The ability to predict the protein corona would aid the design of NPs by decreasing the time and cost of experiments. We describe the development and use of random forest regression and classification models to predict protein abundance and enrichment, respectively, on the surface of NPs using a dataset of NP, protein, and experimental features. These models were trained using data generated in-house through the synthesis and functionalization of NPs with varied core material, surface ligand, diameter, and zeta potential. NPs were incubated with fetal bovine serum, a common protein source for cultured cells, to form a corona, which was characterized by proteomics. Both models identified protein abundance in the serum used to form the corona as the most significant predictor of corona proteins. NP zeta potential and hydrodynamic diameter emerged as the most important NP factors. The random forest regression model was used to test the ability to predict the protein corona of NPs that were excluded from the training data. We highlight the best and worst predictions. These findings offer a machine learning approach to guide experiments.
The ability to predict the protein corona would fill an important gap in the design and use of new nanomaterials. Much previous work,1–5,10,19–23,25,26,28,29,31,37 including our own,6,24,27,30,32–36 has shown that the protein corona, rather than the bare NP, determines how cells bind, internalize, and respond to the protein–NP complexes. For example, previous experiments using 105 gold NPs with three different gold cores (15 nm, 30 nm, 60 nm) and 67 different ligands (small molecules, polymers, peptides, surfactants) showed that the protein corona was a predictor of cellular association and pointed towards the importance of hyaluronan-binding proteins in the corona.31 Predicting the protein corona would provide the first step in predicting the cellular response.
Recent work has brought the tools of machine learning (ML) to the challenge of protein corona prediction. For example, previous work has predicted the protein corona formed on single-walled carbon nanotubes using a random forest classifier (RFC).39 The RFC model was trained on a dataset of human cerebrospinal fluid and blood plasma proteins, characterized by proteomics and physicochemical features derived from protein sequences in UniProt. The model was then used to predict the adsorption of human cerebrospinal fluid and blood plasma proteins on the same single-walled carbon nanotubes. This was a significant result showing that the RFC model was effective in predicting which proteins would adsorb on single-walled carbon nanotubes. The model also identified key protein features that were associated with a higher binding affinity for single-walled carbon nanotubes. These protein features included a high content of solvent-exposed glycines and a high percentage of non-structure associated amino acids, those amino acids not associated with helices, sheets, or turns. The model was also generalizable to other nanomaterials, such as polystyrene NPs (pNPs). Another RFC model was used to predict the protein corona formed from human serum proteins adsorbed on nanostructures formed from DNA.40 A separate study using a RFC model examined yeast protein enrichment on silver NPs (10 nm and 100 nm) as a function of protein, NP, and solvent features using a previously published database of yeast protein enrichment on silver NPs.41 This model illustrated the ability to utilize existing proteomic data to train new ML models, indicating the potential power of data sharing and data scraping. As data sharing becomes increasingly common, researchers have access to large quantities of proteomics data from other research groups, enabling the generation of large datasets that capture multiple protein and NP features for use in model development. This use of external data was again demonstrated using a random forest regression (RFR) model with 652 different NPs (silver, gold, iron oxide, titanium dioxide, silicon dioxide, liposomes, and polystyrene) with coronas formed from human serum, bovine serum, or human plasma.42 The scraped and analyzed data featured a combination of qualitative factors such as NP type and shape, surface modifications, and dispersion mediums, and quantitative factors such as NP diameter and zeta potential. The RFR model was successful in predicting the composition of the protein corona across NPs and protein sources, suggesting the viability of scraped data to train regression models.
We describe the development and use of two different ML models, RFR and RFC, to predict protein corona composition based on a combination of NP features, protein features, and experimental features. In comparison, previous research using ML to predict the protein corona focused on protein features or a combination of protein features and NP features with proteins as the dominant factor.39,41 Our models provide a new focus on NP features and experimental features. The RFR model predicts individual protein abundances in the corona as a quantitative, continuous value. The RFC model predicts whether a specific protein is enriched or depleted relative to its abundance in the serum, a categorical value. In addition to building on previous protein corona models,39–42 random forest models were selected as they provide a connection between model outputs and physical interpretation of the results.
We developed the RFR and RFC models in parallel to compare their performance. To ensure uniform data handling, all of the data used in our models was generated in-house using a semi-automated workflow of corona formation, purification, and characterization using a liquid-handling robot along with a low-cost proteomics protocol to characterize these samples.43,44 To train and test our ML models, we generated 11 NPs with varying features (core material, surface ligand, diameter, and zeta potential) to probe the relationship between NP properties and the protein corona. Proteomics was used to characterize protein coronas on the 11 NPs following incubation with fetal bovine serum (FBS; 10% and 100%). We selected FBS as the protein source for our protein–NP samples as it is a common nutrient source for cells in culture, well-documented in protein sequence databases, and frequently used in other protein corona studies.45–48 Protein features (e.g. secondary structure, percentage of polar amino acids) were derived from protein sequence data. Experimental features included NP and protein incubation concentrations and separation method. This combination of NP, protein, and experimental features generated an input dataset of 84 total features that was used to train and validate predictions for RFR and RFC models.
We find that both the RFR and RFC models had high performance metrics. The RFR model identified 61 significant features and the RFC model identified 16 significant features for corona prediction. Thirteen features were shared between the two models, with protein abundance in FBS selected as the most significant feature. NP zeta potential and hydrodynamic diameter were the next most important features. We then tested the ability of these models to predict the protein corona of previously unseen NPs using the RFR model. The resulting R2 values for corona prediction ranged from 0.45–0.88.
These results suggest that both regression and classification models can serve as computational tools to predict protein–NP interactions. The ability to predict a protein corona based on NP features, which are relatively inexpensive to determine compared to full biological experiments, and protein features, which are tabulated in existing databases, would reduce the cost and the time of current experimental methods. Previous work has shown that corona formation can lead to mis-targeting of NPs,49 masking of targeting ligands,50 and altered biodistribution of NPs.51 In comparison to these negative outcomes, NPs can also be designed to select for specific corona proteins with beneficial properties such as targeted drug delivery to specific organs.3,23,52–54 In the long term, we hope that a detailed NP characterization will allow for the prediction of, for example, the toxicity of new nanomaterials with fewer cell and animal experiments. This would reduce costs and increase throughput in the development and use of new nanomaterials.
NP | Core | Ligand | d TEM (nm) | d h (nm) | PDI | ZP (mV) |
---|---|---|---|---|---|---|
Citrate-mNPS | mNP | Citrate | 82 ± 36 | 149 ± 3 | 0.11 ± 0.01 | −42 ± 6 |
Citrate-mNPL | mNP | Citrate | 182 ± 48 | 229 ± 11 | 0.19 ± 0.02 | −49 ± 4 |
PEI-mNPS | mNP | Polyethyleneimine | 82 ± 36 | 226 ± 62 | 0.22 ± 0.08 | 29 ± 4 |
PEI-mNPL | mNP | Polyethyleneimine | 182 ± 48 | 282 ± 78 | 0.25 ± 0.09 | 39 ± 4 |
PVP-Au-mNPS | Gold-mNP | Polyvinylpyrrolidone | 98 ± 60 | 271 ± 17 | 0.31 ± 0.04 | −12 ± 4 |
PVP-Au-mNPL | Gold-mNP | Polyvinylpyrrolidone | 244 ± 62 | 316 ± 85 | 0.22 ± 0.04 | −11 ± 3 |
PEI-Au-mNPS | Gold-mNP | Polyethyleneimine | 98 ± 60 | 229 ± 17 | 0.19 ± 0.03 | 12 ± 3 |
PEI-Au-mNPL | Gold-mNP | Polyethyleneimine | 244 ± 53 | 291 ± 9 | 0.15 ± 0.04 | 12 ± 1 |
PEG-Au-mNPL | Gold-mNP | Polyethylene glycol (5k) | 244 ± 53 | 610 ± 90 | 0.38 ± 0.04 | −3 ± 3 |
COOH-pNP | Polystyrene | Carboxylate | 200 ± 10 | 221 ± 2 | 0.02 ± 0.01 | −63 ± 9 |
PEG-pNP | Polystyrene | Polyethylene glycol (2k) | 200 ± 23 | 266 ± 7 | 0.13 ± 0.06 | −7 ± 3 |
The mNPs were functionalized using previously published protocols.56 To achieve an adsorbed coating of PEI, dry iron mNPs (5 mg mL−1) were added to polyethyleneimine (PEI; 0.1 mM (aq, #408727, Sigma-Aldrich)). The mixture was shaken (1 h) on a rotary shaker at RT. The mixture was washed three times with water using a magnet to separate the iron mNPs. The PEI coating was verified by a positive zeta potential.
Gold nanoseeds were synthesized following a previously published protocol.56,57 In brief, 44 mL of DI water, 3 mL of 100 mM sodium hydroxide (aq, #S8045, Sigma-Aldrich), and 1 mL of 50 mM tetrakis-(hydroxymethyl) phosphonium chloride (#404861, Sigma-Aldrich) were mixed in an Erlenmeyer flask (100 mL). After mixing for 5 minutes with a magnetic stir bar, 1.5 mL of 25 mM gold(III) chloride trihydrate (#520918, Sigma-Aldrich) was added. After the addition of the gold salt, the solution turned a deep red, signifying the formation of gold nanoseeds.
The PEI-mNPs were coated with gold nanoseeds by adding the PEI-mNPs (1 mg mL−1) to a solution of gold nanoseeds (50 mL, 10 nM). The gold nanoseeds were grown into a gold shell to get a more complete surface coating. The growth of the gold shell was stabilized by adding NPs functionalized with gold nanoseeds (25 μg mL−1) to polyvinylpyrrolidone (PVP; 9.85 mg mL−1, #PVP40, Sigma-Aldrich). After vortexing, hydroxylamine (75 μg mL−1, #159417, Sigma-Aldrich) and gold(III) chloride trihydrate (75 μg mL−1, #520918, Sigma-Aldrich) were successively added. The color of the solution took on a bluish-purple tint within minutes of adding the gold mixture. The resulting PVP-Au-mNPs were separated using a magnet, washed three times with DI water, and resuspended in DI water. The gold shell growth was confirmed using transmission electron microscopy (TEM), as described in NP characterization.
To obtain NPs with a positive zeta potential and vary the ligand of the NPs, PVP was exchanged for PEI by shaking the PVP-Au-mNPs (1 mg mL−1) in PEI 0.1 mM (aq) for 1 hour. Following ligand exchange, the NPs were removed from suspension using a magnet, washed four times with DI water, and resuspended in DI water. Ligand exchange was confirmed by the zeta potential of the resulting NPs.
To obtain a near-neutral surface charge and vary the ligand of the NPs, PVP was displaced with thiolated polyethylene glycol (PEG) by shaking the PVP-Au-mNPs in a thiol PEG solution (10 mM, A3029-1/M-SH-5000, JenKem Technology, Plano, TX) for 1 hour. Following ligand exchange, the NPs were removed from suspension using a magnet, washed four times with DI water, and resuspended in DI water. Ligand exchange was confirmed by the zeta potential of the resulting NPs.
Hydrodynamic diameter, polydispersity index, and zeta potential of the NPs (10–100 μg mL−1 in PBS diluted 1:
100 in DI water) were measured using DLS (Zetasizer, Malvern Instruments, Worcestershire, England). Measurements were carried out with three distinct samples. Each measurement was performed for 12–30 runs. The average and standard deviation are reported for all measurements. Electrophoretic mobility was converted to zeta potential using the Smoluchowski approximation.
Protein concentration was measured with the Pierce 660 nm Protein Assay Reagent (referred to as a 660 nm assay; #2260, Thermo Fisher Scientific) with the addition of Ionic Detergent Compatibility Reagent (#22663, Thermo Fisher Scientific) according to the manufacturer's instructions. The concentration of protein present in the hard corona was determined by removing the proteins from the NPs by incubating with sodium dodecyl sulfate (SDS) buffer (5% w/v, #L3771, Sigma-Aldrich) for 30 minutes at RT. Protein concentration was then determined by measuring absorbance at 660 nm using a plate reader (SpectraMax iD3, Molecular Devices, San Jose, CA). A residual amount of protein is resistant to SDS removal independent of the duration of SDS incubation, as shown previously.43
Proteomic analysis was carried out in the Proteomics and Metabolomics Core Facility, part of the Duke Center for Genomics and Computational Biology, as described previously.43,44 In brief, digested samples were analyzed using LC-MS/MS with ≤ 25 mg of digested protein injected. MicroFlow LC was performed with an ultra-performance liquid chromatography (UPLC, 1 mm × 100 mm, M-Class, Waters Corporation; 80 μL min−1) column and a 17 minutes total elution time. The column was run with an acetonitrile gradient (5–40%) with 0.1% formic acid. Peptide fragments were analyzed using in-line tandem mass spectrometry (Orbitrap Fusion Lumos, Thermo Fisher).
We analyzed the LC-MS/MS data using MaxQuant (v2.5.2.0, Max Planck Institute, Munich, Germany), an open-source software designed to qualitatively and quantitatively analyze mass spectrometry data.59,60 The raw LC-MS/MS spectra were searched, using their integrated Andromeda search engine, against the Swiss-Prot Bovine (6046 proteins) canonical protein knowledge base from UniProt, accessed on May 22nd, 2024.61 A custom contaminants file was used, which contained a relevant subset of the Common Repository of Adventitious Proteins (cRAP) database.62 For protein and peptide quantification and identification, default MaxQuant parameters were used: a 0.01 false discovery rate, a minimum peptide length of 7 amino acids, a maximum peptide length of 25 amino acids, oxidation, acetyl groups as variable modifications, and carbamidomethyl as a fixed modification. The Intensity method in MaxQuant was used for abundance quantification calculations.
The resulting proteomic data was analyzed and filtered in Perseus (v2.0.11, Max Planck Institute). Proteins were excluded if considered contaminants, quality control standards, or only identified by site. Data normalization was performed in Python (v3.11, Python Software Foundation, Beaverton, OR). A quantitative internal standard was not used for these experiments. To correct for any change in performance or differences in protein loading, each sample was normalized to itself by dividing by the mean of the interior 80% of the protein intensities.63 Each sample was scaled to have the same average. We report these values as percent normalized abundance. Fold change for each protein was calculated by taking the log base 2 of the normalized corona abundance divided by serum abundance. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the Proteomics Identification Database (PRIDE) partner repository with the dataset identifier PXD053700 and 10.6019/PXD053700.64 In total, 92 proteins were identified in the sample of FBS used to form the coronas. This is likely due to the overwhelming signal from albumin in the FBS, which limits the detection of lower abundance proteins. In comparison, 369 proteins were identified in the corona samples including the 92 also identified in FBS. Formation of a corona serves as an enrichment step allowing the identification of more unique proteins than in FBS alone.
After data pre-processing, recursive feature elimination with k-fold cross-validation (ten folds) was used to identify the most important predictive features on the training data split (90%) (Table S4). This feature selection process was iterative, with a step size of one, and maintained a minimum of one feature until completion. A custom scoring method was used, which combined mean squared error with a penalty for feature variation and quantity to refine feature selection effectively. This custom approach also integrated evaluation criteria (R2, mean squared error, Pearson correlation coefficient, and Spearman's rank correlation coefficient) across multiple data folds, providing a comprehensive assessment of model performance on the selected subset of features across ten folds.
The selected features were then used to predict on the test data split (10%), specifically predicting the log2-transformed abundance values. The performance of the model was evaluated using the custom scoring method which assessed the accuracy of the predictions with respect to the known data values in the test split, hidden during model training.
The RFC model was trained on protein enrichment data, which was calculated based on the log2-transformed ratio of corona abundance to serum abundance. Handling of protein data was identical to that described for the RFR model. Enrichment values were classified with their respective binary categorization for future RFC classification (i.e. thresh parameter set to zero).
Recursive feature elimination with k-fold cross-validation (ten folds) was implemented to identify the most important features for corona prediction (Table S5). Similar to RFR, a step size of one was used to iteratively eliminate features until the optimal number of features was determined. This feature selection process was evaluated with standard classifier model evaluation metrics; area under the receiver operatoring characteristic curve (AUROC; measure of the ability of the model to distinguish between classes, considering both sensitivity and specificity), accuracy, precision (positive predictive value), F1 score (harmonic mean of precision and recall), and recall (true positive rate). The optimal features identified were subsequently used to train the classification model and predict on a train-test split of 90/10. Predictions were assessed using the same evaluation metrics used to assess feature selection.
Protein | FBS (rank) | Citrate-mNPL | Citrate-mNPS | PEI-mNPL | PEI-mNPS | PEI-Au-mNPL | PEI-Au-mNPS | PVP-Au-mNPL | PVP-Au-mNPS | COOH-pNP | PEG-pNP | PEG-Au-mNPL |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Albumin | 54.3 (1) | 33.7 | 27.3 | 42.7 | 43.5 | 36.9 | 23.4 | 41.4 | 32.5 | 19.9 | 16.8 | 46.0 |
Alpha-2-HS-glycoprotein | 15.7 (2) | 12.6 | 12.5 | 13.8 | 14.1 | 13.2 | 8.3 | 12.1 | 10.2 | 7.6 | 4.0 | 14.4 |
Alpha-1-antiproteinase | 7.55 (4) | 5.5 | 3.1 | 6.2 | 5.9 | 3.8 | 2.2 | 6.0 | 4.3 | 3.0 | 4.1 | 5.7 |
Alpha-2-macroglobulin | 2.47 (5) | 1.8 | 1.7 | 3.0 | 2.5 | 1.9 | 2.3 | 2.8 | 2.4 | 0.96 | 1.1 | 1.9 |
Apolipoprotein A-I | 1.20 (7) | 1.3 | 0.8 | 1.7 | 1.6 | 0.9 | 1.0 | 1.5 | 1.8 | 4.9 | 13.7 | 1.3 |
Complement C3 | 0.78 (9) | 3.9 | 9.1 | 1.0 | 1.9 | 0.5 | 0.5 | 1.2 | 1.1 | 0.5 | 0.3 | 0.8 |
Hemoglobin subunit alpha | 0.51 (14) | 1.3 | 1.5 | 2.0 | 1.2 | 1.0 | 1.8 | 3.4 | 6.8 | 13.8 | 40.0 | 2.9 |
Inter-alpha-trypsin inhibitor heavy chain H3 | 0.15 (26) | 0.1 | 0.2 | 2.8 | 2.3 | 7.5 | 13.2 | 0.2 | 0.1 | 0.1 | 0.8 | 0.1 |
Prothrombin | 0.09 (29) | 1.8 | 2.0 | 2.3 | 2.3 | 5.2 | 7.7 | 0.8 | 1.0 | 6.4 | 0.7 | 0.8 |
Apolipoprotein E | 0.02 (45) | 3.1 | 2.1 | 0.2 | 1.9 | 0.15 | 0.2 | 5.7 | 8.6 | 2.0 | 4.7 | 7.2 |
While all NPs show a similar interaction with albumin (depletion), the other most abundant proteins show a wider range of protein–NP interactions (Fig. 1). For example, complement C3, which plays a central role in the activation of the complement system,73,74 exhibits a notably high spread (−2 to +4.5 log2 fold change) in enrichment and depletion across NPs, in comparison to the more narrow spread for albumin (−2 to −0.25 log2 fold change). The wide range in enrichment and depletion values for complement C3 suggests that NP features such as diameter, zeta potential, and surface functionalization significantly impact the adsorption of complement C3 onto the NP surface. In contrast, proteins such as albumin (−2 to −0.25 log2 fold change) and alpha-2-HS-glycoprotein (−2 to +1 log2 fold change) show narrower log2 fold changes across different NPs, suggesting less dependence on NP features. Similarly, apolipoprotein A-I and alpha-2-macroglobulin exhibit a narrow spread of log2 fold changes (−0.5 to +3.5 and −1.5 to +1, respectively), suggesting stable interactions across different NP features. Complement C3 tends to be less enriched with positively charged NPs (e.g., PEI-mNPS, PEI-mNPL), though high variability makes it difficult to establish a clear correlation with zeta potential. Apolipoprotein A-I also shows moderate to high enrichment with negatively charged NPs and consistent or slight depletion with positively charged NPs. While the data does not consistently support the idea that decreasing NP zeta potential reduces protein enrichment, highly negative zeta potentials (e.g., COOH-pNP with −63 mV) may decrease complement C3 enrichment.
In addition to NP and protein features, previous work has shown that the separation method used in the preparation of the protein corona, magnetic pull-down or centrifugation, is one factor in the composition of the protein corona.56 We note increased enrichment of hemoglobin subunit-alpha and apolipoprotein A-I on PEG-pNP and COOH-pNP, which underwent centrifugation, in comparison to the mNPs, which are separated using magnetic pull-down (Fig. 1).
Overall, this level of NP, protein, and experimental feature complexity makes it challenging to extract trends. Instead, we use the three NP core materials (mNP, gold-mNP, and polystyrene), two core diameters (82 nm and 182 nm), six surface ligands, and seven effective surface charges as training data for ML.
Random Forest models were selected for their ability to reduce overfitting, due to their ensemble nature, and strengths for this specific application.75–77 For example, random forest models average results over multiple trees, which leads to higher accuracy than support vector machines.78 Neural networks require training data on the scale of ∼10000 or greater data points, making them less useful for this type of data-limited application.79 Implementation in Python allows for code sharing and use by others. RFR was employed to predict continuous numerical values, specifically the abundance of individual proteins in the corona, which we quantified as peptide intensities. RFC was utilized to predict categorical outcomes, determining whether a protein would be enriched or depleted in the protein corona relative to the serum used to form the corona. Both models were trained to determine the optimal number of input features using recursive feature elimination with k-fold cross-validation across ten folds. Model performance was evaluated using multiple performance metrics.
For RFR, R2, mean squared error, Pearson correlation coefficient, and Spearman's rank correlation coefficients were used to assess model performance. R2 provides a measure of how well true dataset values are replicated by the model, on a scale of 0 to 1, with 1 being a perfect replication by the model. An R2 of 0 would indicate that the predictions are random. Mean squared error measures the difference between the predicted values made by the model and true values in the dataset, with values closer to zero indicating better model performance. The Pearson correlation coefficient measures the correlation between the predicted and observed corona protein abundance values. Spearman's rank correlation coefficient measures the rank-based correlation between predicted and observed corona protein abundance values. Both Pearson correlation and Spearman's correlation coefficients can have values ranging from −1 to 1. A value of +1 indicates perfect positive correlation. For RFC, area under the receiver operating characteristic curve (AUROC), accuracy, precision, F1 score, and recall were used as performance metrics. AUROC measures the ability of the RFC model to distinguish between classes, defined as protein enrichment and depletion, considering the true positive rate and false positive rate across all classification thresholds (i.e. enriched or depleted). An AUROC value of 0.5 corresponds to a random guess and 1.0 represents perfect classification. Accuracy describes the ratio of correctly predicted proteins to total number of proteins. Precision describes the proportion of true positives to predicted positives. Recall describes the proportion of true positives to actual positives (both true positives and false negatives). Precision and recall are combined in the F1 score, which is the harmonic mean of the two values, meaning both precision and recall are equally weighted, which can range from 0 to 1 (perfect precision and recall).
We first used recursive feature elimination with k-fold cross-validation to identify and score the most important NP, protein, and experimental features for the protein corona. Recursive feature elimination with k-fold cross-validation is a feature selection technique that recursively removes less important features while evaluating the performance of the model through cross-validation. This technique identifies the optimal subset of features that balance model accuracy and complexity. The output is the list of selected features and their associated percentage of importance. Results for the RFR model show the negative mean squared error as a function of the number of features (Fig. S1). The error decreases sharply initially and stabilizes at 61 features, indicating that 61 features are optimal for our model (Table 3). The average and standard deviation of the R2, mean squared error, Pearson correlation coefficient, and Spearman's rank correlation further validate robustness (Table S7). High scores across these metrics reinforce the utility of the RFR model and demonstrate the efficacy of the selected features (Table S4).
Feature | Importance (%) |
---|---|
Protein abundance in FBS | 46.7 |
Zeta potential | 4.1 |
d h | 3.2 |
Protein incubation concentration (10% or 100%) | 2.1 |
d TEM | 1.8 |
% Phenylalanine | 1.7 |
% Non-structure associated amino acids | 1.4 |
% Asparagine | 1.3 |
NP incubation concentration | 1.2 |
% Alanine | 1.2 |
Following feature selection using recursive feature elimination with k-fold cross-validation, the RFR model demonstrated robust predictive performance, as reflected by high evaluation metrics for predictions made on the test data split (Fig. S2 and Table 4). A comprehensive analysis of RFR model and prediction performance is provided in SI (Table S7).
Evaluation metric | Score |
---|---|
R 2 | 0.81 |
Mean squared error | 2.02 |
Pearson | 0.90 |
Spearman | 0.87 |
For both models, the abundance of individual proteins in FBS was identified as the most important feature for predicting the protein corona (Table 3 and S5). This is intuitive, as proteins with higher relative abundance are more likely to interact with NPs. However, protein abundance is not the single determinant of protein adsorption on NPs, reflecting the distinction between kinetic and thermodynamic control in protein corona formation. For instance, despite being the most abundant protein in FBS (54.3%), albumin is depleted in the corona of all NPs (Tables 2, S6 and Fig. 1). This is in agreement with previous research showing that initial kinetic adsorption of proteins can be displaced by proteins with greater thermodynamic stability on the NP surface.22,29,37,80–85 This indicates that while protein abundance is a key factor, NP and protein features also influence protein adsorption. Both models identified zeta potential as the next most significant feature after relative protein abundance values, with importance values of 4.1% and 8.5% for the RFR and RFC models, respectively (Table 3 and S5). The hydrodynamic diameter of the functionalized NPs (dh) was the third most significant feature, with an importance of 3.2% and 7.6% for RFR and RFC, respectively (Table 3 and S5).
NP | FBS incubation (%) | R 2 |
---|---|---|
Citrate-mNPS | 100 | 0.88 |
Citrate-mNPS | 10 | 0.88 |
Citrate-mNPL | 10 | 0.87 |
PVP-Au-mNPL | 100 | 0.86 |
PVP-Au-mNPL | 10 | 0.85 |
PEI-mNPS | 10 | 0.85 |
PVP-Au-mNPS | 10 | 0.84 |
PEI-mNPS | 100 | 0.84 |
Citrate-mNPL | 100 | 0.83 |
PVP-Au-mNPS | 100 | 0.79 |
PEI-mNPL | 100 | 0.78 |
PEI-mNPL | 10 | 0.78 |
COOH-pNP | 10 | 0.77 |
PEG-Au-mNPL | 100 | 0.76 |
COOH-pNP | 100 | 0.71 |
PEI-Au-mNPL | 100 | 0.69 |
PEI-Au-mNPS | 100 | 0.63 |
PEG-pNP | 100 | 0.45 |
To determine if higher abundance corona proteins, such as albumin (Table 2), were more likely to be predicted correctly, we measured the correlation of the top ten most abundant corona proteins with the accuracy of prediction. We found no correlation between protein corona abundance and predictive ability (Fig. S6).
Both RFR and RFC models showed excellent performance metrics (Tables 4, S7–S9 and Fig. S1–S5). Our models identified the top feature in predicting the protein corona composition to be the protein abundance in FBS (46.7% and 27.7% for RFR and RFC, respectively) (Table 3 and S5). The next most important feature for both models was zeta potential (4.1% and 8.5% for RFR and RFC, respectively). The hydrodynamic diameter of the functionalized NP was the third most significant feature, with an importance of 3.2% for RFR and 7.6% for RFC. To evaluate RFR model predictions for new NPs, we implemented leave-one-group-out cross-validation, where each NP was iteratively excluded from training and then used as the test set. This approach allowed us to assess the predictive ability for each NP individually, identifying which were the best and worst performing (Fig. 2 and Tables 5 and S10). The model had the highest accuracy in predicting protein abundances for citrate-mNPS (Fig. 2 and Tables 5 and S10) and the lowest accuracy in predicting protein abundances for PEG-pNP (Fig. 2 and Tables 5 and S10). Future work will explore the specific protein–NP interactions that underlie these predictions.
Our work builds on previous ML models with the common goal of protein corona prediction.39–42 In comparison to the previous model developed for single-walled carbon nanotubes,39 our models incorporate NP and experimental features, in addition to protein features, into the training set. Our models also incorporated the abundance of individual proteins in the FBS used to form the corona, which was ultimately the most important feature identified by both models (Tables 3, S4 and S5). In comparison to the model developed with silver NPs and yeast proteins,41 FBS provides a more relevant protein source, especially for NPs used in applications with cultured cells. Additionally, our RFC model achieved an AUROC of 0.99 and F1 score of 0.93, whereas the RFC run on the yeast dataset achieved an AUROC of 0.83 and F1 score of 0.81. We also included core composition as an NP feature, a feature suggested in their work to be important for future consideration. While our models were constructed from data obtained from 18 protein–NP combinations, previous work has used a dataset with >600 NPs.42 This work focused on NP and experimental features, lacking protein features. The inclusion of protein features in our dataset led to a total feature count of 84. In comparison, this previous work had a feature count of only 21. In comparison to this previous work, our RFR model enables a quantitative prediction of specific protein abundance in the corona. In addition, our leave-one-group-out cross-validation tests the predictive ability of the RFR model on NPs that were not present in the training data, a key goal in the use of corona prediction for the design of novel nanomaterials.
One limitation of this current predictive ability is that a totally novel NP with features well-outside of our existing data set of NPs (mNP, gold-mNP, polystyrene) and ligands (citrate, polyethyleneimine, polyvinylpyrrolidone, polyethylene glycol, carboxylate) may need additional experimental data for predictions with high accuracy.
Previous work has shown that the protein corona determines the cellular and physiological response to NPs.1–6,10,19–37 The ability to predict the protein corona could provide a first step in predicting the cellular response. For example, previous work using 105 different gold NPs showed that the protein corona determined the interaction of NPs with cells, but this was determined experimentally in a tour de force experimental study.31 We hope the RFR and RFC models described above will provide a computational tool for pre-screening NP candidates for use in biological applications prior to experiments. For example, in the longer term, we could envision the use of a ML-predicted protein corona to reduce the need for cell and animal experiments to determine NP toxicity.
Supplementary information including additional data, as referenced in the text, and a comparison of the RFR and RFC models is available. See DOI: https://doi.org/10.1039/d5na00425j.
Footnotes |
† Current address: Donaldson Company, Bloomington, Minnesota, USA 55431. |
‡ Current address: Duke University School of Medicine, Durham, North Carolina, USA 27710. |
This journal is © The Royal Society of Chemistry 2025 |