Viacheslav
Muratov
a,
Karolina
Jagiello
*ab and
Tomasz
Puzyn
*ab
aUniversity of Gdansk, Faculty of Chemistry, Laboratory of Environmental Chemoinformatics, Wita Stwosza 63, 80-308 Gdansk, Poland. E-mail: karolina.jagiello@ug.edu.pl; tomasz.puzyn@ug.edu.pl
bQSAR Lab Ltd., Trzy lipy 3, 80-172 Gdansk, Poland
First published on 28th August 2025
The primary aim of our study was to address the problem of transcriptomic data complexity by introducing a novel transcriptomic response index (TRI), compressing the entire transcriptomic space into a single variable, and linking it with the inhaled multiwalled carbon nanotubes (MWCNTs) properties. This methodology allows us to predict fold change values of thousands of differentially expressed genes (DEGs) using a single variable and a single quantitative structure–activity relationship (QSAR) model. In the context of this work, TRI compressed 5167 DEGs into a single variable, explaining 99.9% of the entire transcriptomic space. Further TRI was linked to the properties of inhaled MWCNTs using a nano-QSAR model with statistics R2 = 0.83, QCV2 = 0.8, and Q2 = 0.78, which show a high level of goodness-of-fit, robustness, and predictability of the obtained model. By training a nano-QSAR model on fold changes of thousands of DEGs using a single variable, our study significantly contributes not only to new approach methodologies (NAMs) focused on reducing animal testing but also decreases the amount of computational resources needed for work with complex transcriptomic data. Developed during this work, the software called ChemBioML Platform (https://chembioml.com) offers researchers a powerful free-to-use tool for training regulatory acceptable machine learning (ML) models without a strong background in programming. The ChemBioML Platform integrates the ML capabilities of Python with the advanced graphical interface of unreal engine 5, creating a bridge between scientific research and the game development industry.
New conceptsWe introduce the transcriptomic response index (TRI) – a novel, biologically meaningful metric that compresses complex, high-dimensional gene expression data into a single predictive variable. This approach represents a conceptual breakthrough in the field of nanotoxicogenomics, enabling a direct, quantitative link between the structural properties of nanomaterials and global transcriptomic effects in exposed organisms. In contrast to existing models that either rely on isolated biomarkers or attempt to predict thousands of individual gene responses separately, TRI captures the entire transcriptomic variance (99.9%) in one interpretable parameter. This allows us to construct a robust, regulatorily compliant nano-QSAR model based on standard molecular descriptors (SiRMS), making transcriptomics scalable, reproducible, and predictive. This work uniquely bridges systems biology, chemoinformatics, and machine learning, offering a new paradigm for evaluating nanomaterial-induced biological perturbations. It further advances the principles of new approach methodologies (NAMs) by enabling mechanism-based hazard prediction without the need for animal testing or deep computational resources. The TRI concept opens new directions in nanoinformatics, particularly for safe-by-design strategies, high-throughput screening, and regulatory decision-making in nanotechnology. |
Although the adverse effects of MWCNTs are well-documented, the molecular mechanisms driving these effects remain largely unknown. In recent studies, researchers have started applying toxicogenomic approaches to close this gap and investigate the pulmonary toxicity of MWCNTs.4–8 It was shown that MWCNTs can induce gene expression changes associated with lung fibrosis and inflammation, comparable to those caused by known fibrogenic agents like bleomycin and bacterial infections.4 Recent research highlights the potential of gene expression profiling in predicting the pulmonary toxicity of new nanomaterials,4,5 offering a promising methodology for future research and risk assessment. Building on this genomic insight, advanced computational methods have emerged as powerful tools for correlating material properties with biological outcomes. For instance, Jagiello et al. established a methodology for constructing quantitative structure–activity relationship (QSAR) models based on canonical pathways. The authors proposed a grouping strategy for MWCNTs based on the aspect ratio affecting genes associated with specific pathways.6 Following this, Merugu et al. developed a QSAR model to predict canonical pathway induced by MWCNTs and triggering lung fibrosis and atherosclerosis, both initiated through acute immune response.7 Fortino et al. highlight the importance of these findings because MWCNTs cause pulmonary damage in the body after intratracheal instillation.8
In parallel, perturbation-based machine learning (PTML) modeling has gained traction as a complementary strategy for toxicity prediction. PTML offers a high degree of interpretability and supports the simultaneous modeling of multiple toxicological endpoints across diverse biological systems, including cells, microbes, and animal models. This flexibility is especially advantageous in nanotoxicology, where experimental conditions often vary and biological responses are heterogeneous.9–12 By integrating diverse datasets, PTML enables multi-endpoint predictions that improve both the efficiency of hazard assessments and the prioritization of materials for further testing. Moreover, its potential to support regulatory decision-making through transparent and mechanistically informed predictions positions PTML as a valuable asset in the next generation of nanotoxicological research.
The objective of our study is to investigate the potential of combining supervised learning methods, such as ridge regression, with unsupervised methods, like principal component analysis (PCA), to generate comprehensive and accurate predictions for the thousands of differentially expressed genes (DEGs) affected by MWCNT exposure in mouse lungs. In contrast to previous approaches, such as the work by Jagiello et al. and Merugu et al., which linked MWCNT properties with the benchmark dose lower confidence limit (BMDL) of a single transcriptomic pathway,6,7 here, we introduce a novel transcriptomic response index (TRI) derived from the fold change (FC) values of thousands of DEGs. By compressing the high-dimensional transcriptomic space into a single, informative variable, TRI measures all gene perturbations induced by MWCNT exposure. Biologically, a higher TRI value reflects a more pronounced deviation from baseline gene expression, which may correlate with the activation of key toxicological processes such as inflammation, fibrosis, and oxidative stress. We introduce a two-tiered analytical strategy to calculate TRI and link it to the exposure conditions. First, unsupervised PCA compresses the complex transcriptomic data into a comprehensive index (TRI) that describes overall gene expression changes. Subsequently, a supervised ridge regression model is used to correlate the structural properties of the MWCNTs, quantified using simplex representation of molecular structure (SiRMS) descriptors, with the TRI. Although SiRMS descriptors are traditionally applied to small molecules,13–17 they have also been successfully adapted to predict metal oxide toxicity.18 Our study is the first to leverage the full version of SiRMS descriptors to predict the transcriptomic response to MWCNT exposure in mouse lungs.
To ensure the reliability and robustness of our predictive models, we developed the ChemBioML Platform – an integrated machine learning software that features robust data preprocessing, a self-optimizing algorithm for selecting optimal descriptors and hyperparameters, and OECD-compatible validation techniques.19 This platform is designed to generate accurate, reproducible results that satisfy regulatory and academic standards for QSAR and other ML model evaluations. Thus, our work advances scientific knowledge on the molecular mechanisms underlying MWCNT-induced lung toxicity and offers a practical tool for the academic community. This manuscript and the accompanying GitHub repository (https://chembioml.com, https://github.com/chembiodev/ChemBioML-Platform) provide a robust platform for training ML models that can elucidate the complex molecular interactions following MWCNT exposure.
| # | MWCNT name | Concentration (μg per mouse) | Length (nm) | Diameter (nm) | Aspect ratio | Functionalization |
|---|---|---|---|---|---|---|
| NP_1 | NM-401 | 18 | 4050 (±2400) | 67 (±26.2) | 61 | Pristine |
| NP_2 | NM-401 | 54 | 4050 (±2400) | 67 (±26.2) | 61 | Pristine |
| NP_3 | NM-401 | 162 | 4050 (±2400) | 67 (±26.2) | 61 | Pristine |
| NP_4 | NRCWE-26 | 18 | 850 (±457) | 11 (±4.5) | 77 | Pristine |
| NP_5 | NRCWE-26 | 54 | 850 (±457) | 11 (±4.5) | 77 | Pristine |
| NP_6 | NRCWE-26 | 162 | 850 (±457) | 11 (±4.5) | 77 | Pristine |
| NP_7 | NRCWE-043 | 18 | 771.3 (±3471) | 26.73 (±6.88) | 29 | Pristine |
| NP_8 | NRCWE-043 | 54 | 771.3 (±3471) | 26.73 (±6.88) | 29 | Pristine |
| NP_9 | NRCWE-044 | 18 | 1330 (±2454) | 32.55 (±14.44) | 41 | –OH |
| NP_10 | NRCWE-044 | 54 | 1330 (±2454) | 32.55 (±14.44) | 41 | –OH |
| NP_11 | NRCWE-045 | 18 | 1553 (±2954) | 28.07 (±13.85) | 55 | –COOH |
| NP_12 | NRCWE-045 | 54 | 1553 (±2954) | 28.07 (±13.85) | 55 | –COOH |
| NP_13 | NRCWE-046 | 18 | 717.2 (±1214) | 17.22 (±5.77) | 42 | Pristine |
| NP_14 | NRCWE-046 | 54 | 717.2 (±1214) | 17.22 (±5.77) | 42 | Pristine |
| NP_15 | NRCWE-047 | 18 | 532.5 (±591.9) | 12.96 (±4.44) | 41 | –OH |
| NP_16 | NRCWE-047 | 54 | 532.5 (±591.9) | 12.96 (±4.44) | 41 | –OH |
| NP_17 | NRCWE-048 | 18 | 1604 (±5609) | 15.08 (±4.69) | 106 | –COOH |
| NP_18 | NRCWE-048 | 54 | 1604 (±5609) | 15.08 (±4.69) | 106 | –COOH |
| NP_19 | NRCWE-049 | 18 | 731.1 (±1473) | 13.85 (±6.09) | 53 | –NH2 |
| NP_20 | NRCWE-049 | 54 | 731.1 (±1473) | 13.85 (±6.09) | 53 | –NH2 |
To create a dataset for our study, we have filtered differentially expressed genes (DEGs) from two studies20,21 with the criteria of p-value ≤ 0.05 and an absolute FC ≥ 1.5.21,22,27–30 DEGs that changed significantly in response to at least one MWCNT were filtered. Although a stricter threshold, such as |FC| ≥ 2, has been used in other studies,31,32 we opted for |FC| ≥ 1.5 to remain consistent with the analysis approach used in the original datasets.20,21 This ensured methodological alignment with the initial design of the transcriptomic data. This process resulted in a curated dataset containing 5167 DEGs with post-exposure expression data. For each of the 9 MWCNTs, gene expression was measured under 2 or 3 dosing conditions – 18 μg per mouse and 54 μg per mouse for all MWCNTs, with NM-401 and NRCWE-26 additionally tested at 162 μg per mouse – yielding a total of 20 distinct experimental conditions (see Table S1).
| TRI = PC1 × Variance1 + PC2 × Variance2 + ⋯ + PCn × Variancen | (1) |
For our calculations, we applied the open-source SiRMS Python package, available on GitHub [https://github.com/DrrDom/sirms]. This package allows weighting descriptors by various physicochemical properties, including electronic charge, lipophilicity (LogP), refractivity, and hydrogen bonding.36–38 A concentration-dependent weighting of the simplex descriptors was applied to account for the varying concentrations of MWCNTs in the experiments (18 μg, 54 μg, and 162 μg per mouse). Specifically, we used the 18 μg dose as a baseline for concentration-dependent weighting. Since 54 μg is three times 18 μg and 162 μg is nine times 18 μg, the corresponding descriptors were multiplied by factors of 3 and 9, respectively. This concentration-weighting approach integrates the concentration of inhaled MWCNTs into the molecular descriptors, ensuring that they reflect the dose-dependent variations in the organism's reaction.
Ridge regressor was used with the genetic algorithm (GA) feature selection procedure to train the models predicting TRI. Ridge is a linear regression method that applies L2 regularization during the training of the model, allowing us to use cross-correlated descriptors39 and to use a highly dimensional space of independent variables40 for modeling. Furthermore, the GA is a feature selection method that allows us to identify the best set of descriptors for training the nano-QSAR model.41
A comprehensive validation was performed to address possible overfitting, which can occur when training the model on such a small dataset. The goodness-of-fit of the models was evaluated using the coefficient of determination (R2) and the root mean square error of calibration (RMSEC). The robustness of the model was assessed through leave-one-out cross-validation (CV) using the coefficient of determination for CV (QCV2) and root mean square error for CV (RMSECV). Predictive performance was validated through the external validation coefficient and the root mean square error of external validation (RMSEEXT), together with the coefficient of determination for the external predictions (Q2). The applicability domain of the models, defined by the range of responses and structural feature descriptors, was analyzed using the leverage approach and visualized with a Williams plot. This method is widely used in the evaluation of AD in QSAR modeling with well-defined, previously described algorithms.42,43
The dimensionality reduction achieved by consolidating 5167 gene-specific endpoints into a single TRI variable substantially streamlines the modeling process. Rather than training thousands of individual nano-QSAR models, one for each DEG, the TRI approach enables the training of a single, integrated model. Beyond computational efficiency, the TRI approach enhances interpretability and generalizability. By summarizing the global transcriptomic response into a compact form, the TRI is better suited for predictive modeling and can be readily integrated with other omics data or phenotypic outcomes. This systems-level representation is especially advantageous in contexts where understanding coordinated gene activity is more informative than isolated gene-level effects.
To compute TRI, first, the PCA was applied to the dataset of 5167 DEGs. The calculated values of all principal components and the corresponding TRI scores are provided in Table S2. The first 19 PCs (PC1 through PC19) were selected for TRI calculation, as they collectively explain ∼99.9% of the total variance in the over 5000 gene expression matrix. This indicates that they capture nearly the entire structure of the dataset, making them a robust foundation for dimensionality reduction and feature integration.
In contrast, retaining only the first two PCs captures just ∼43% of the total variance, which would lead to substantial information loss and potentially weaken model performance. Nevertheless, PC1 and PC2 were utilized to construct a simplified two-dimensional “transcriptomic space” to visualize and explore similarities between different MWCNTs based on DEG fold-change patterns (Fig. 1). This 2D representation provides an intuitive overview of the underlying transcriptomic relationships, while the full set of 19 PCs ensures high-fidelity data representation for TRI construction.
Examination of Fig. 1 reveals that most inhaled MWCNTs cluster in the negative regions of both PCs, indicating a high degree of similarity. However, differences are evident; PC2 predominantly reflects the variance associated with DEGs influenced by pristine MWCNTs. In contrast, PC1 exhibits a cluster comprising multiple MWCNT variants with low aspect ratios, including those with –OH and –COOH functionalization and pristine forms. This distribution indicates that using either PC1 or PC2 alone as the endpoint for the nano-QSAR model is insufficient, as neither fully captures the transcriptomic response of all studied MWCNTs. To overcome this limitation, we computed the TRI, which integrates 19 principal components weighted by their respective explained variance. This approach ensures that TRI captures the multidimensional structure of the data, providing a more complete and biologically meaningful representation of global transcriptomic shifts across all studied MWCNTs.
TRI is a highly interpretable and reversible PCA-based index. As a linear combination of gene expression values, TRI enables the prediction of global transcriptomic shifts while maintaining the ability to trace individual gene contributions. Specifically, genes with positive loadings (Table S3) will be upregulated as TRI increases, whereas those with negative loadings will be downregulated. This interpretability makes TRI not only transparent but also biologically informative.
In addition to offering high interpretability through its linear structure and transparent formulation, the TRI is also straightforward to validate. Although unsupervised methods based on principal component analysis (PCA) inherently lack explicit quantitative performance metrics such as R2 or RMSE, TRI benefits from robust qualitative validation. In our study, this included careful preprocessing of the transcriptomic dataset by filtering genes based on stringent significance thresholds (p-value ≤ 0.05 and |fold change| ≥ 1.5), ensuring the absence of outliers via inspection of the PC1–PC2 projection (Fig. 1), and confirming that TRI effectively captures the transcriptomic variance observed across all relevant differentially expressed genes (Table S1). These steps collectively reinforce confidence that TRI accurately represents the underlying biological signal.
Nevertheless, we recognize that the ideal number and configuration of principal components may differ depending on biological context, dataset heterogeneity, and modeling goals. The fixed use of 19 components in this study was tailored to balance comprehensiveness with interpretability. However, for broader applications, such as large-scale, multi-condition experiments or cross-species comparisons, further refinement of the TRI framework may enhance its performance and generalizability. Potential improvements could include exploring alternative dimensionality reduction approaches (e.g., independent component analysis, factor analysis), integrating non-linear manifold learning techniques such as t-SNE or UMAP, or employing data-driven component selection based on cross-validated model performance. Such refinements would not only optimize model efficiency but also strengthen TRI's utility as a generalizable tool for high-dimensional transcriptomic analysis.
Besides recognized limitations, we introduced the TRI approach as a novel framework for integrating chemical structure descriptors with transcriptomic data. It is important to contextualize how this approach compares with existing methods in the field. While traditional QSAR models can be effective for specific endpoints (Lethal Dose 50, e.g.), they often struggle to capture complex biological effects. They can suffer from well-known QSAR limitations (e.g., limited applicability domain and the “activity cliff” problem).45 On the other end of the spectrum, recent advanced methods leverage high-dimensional transcriptomic data directly. For example, deep learning frameworks have been developed to predict entire gene expression profiles from chemical structures.46,47 Such methods (e.g., CIGER and DeepCOP) use graph neural networks or other architectures to forecast genome-wide expression changes and have shown impressive performance on large datasets.46,47 Our TRI approach offers a complementary strategy: instead of predicting thousands of individual gene responses, we condense the transcriptomic effect into an interpretable index and model that index using molecular descriptors for the inhaled MWCNTs.
SiRMS descriptors were calculated for the studied MWCNTs. This method is very commonly used for small molecules due to its ability to balance between the accuracy of predictions and the very comprehensive physicochemical interpretation of the obtained QSAR models.13–17 The SiRMS methodology partitions the molecular structure into simplexes of 1 to 6 atoms, effectively capturing the molecule's fundamental structural elements. These simplexes represent essential features such as functional groups (e.g., hydroxyl, carboxyl groups) and bonding environments (e.g., single, double, and triple bonds). By decomposing the molecule in this manner, SiRMS enables identifying and quantifying specific chemical environments critical to the material's physico-chemical behavior. This detailed representation enhances the model's ability to predict molecular interactions with biological systems, thereby providing valuable insights into the MoA of pulmonary toxicity induced by inhaled MWCNTs.
Previously, SiRMS descriptors were calculated for metal-based engineered nanomaterials (ENMs) in a simplified form, utilizing a one-dimensional method that provided only a general overview of the structures.18 The simplification in this case was necessitated by the uncertainty surrounding the crystal structure of the ENMs used.18 In contrast, for MWCNTs, we do not face such structural limitations. This has allowed us to compute two-dimensional SiRMS descriptors, which offer significantly more profound insights into the studied structures.
Moreover, we leveraged the ability of the SiRMS methodology to weight descriptors by specific properties, including charge, LogP, refractivity, and hydrogen bonding, as detailed in the study published by the script's author.36–38 Additionally, to account for the varying concentrations of MWCNTs used in our study, we implemented a concentration-dependent weighting of the simplex descriptors. Precisely, the number of simplexes represented in each descriptor was scaled proportionally to the amount of inhaled MWCNTs: for a concentration of 18 μg per mouse, the initial set of simplexes was used; for 54 μg per mouse, the descriptors were multiplied by a factor of three; and for 162 μg per mouse, they were scaled by a factor of nine. This weighting approach assumed that it reflects the proportional increase in exposure and thus enables the model to better account for dose-dependent biological responses. As a result, the complete set of approximately 3500 SiRMS descriptors calculated is provided in Table S4. These descriptors were subsequently utilized for training and interpreting the nano-QSAR model, which predicts changes in thousands of DEGs compressed in the TRI.
In this study, descriptor values for MWCNT exposures were linearly scaled based on concentration levels (18, 54, and 162 μg per mouse), under the assumption of a linear dose–response relationship. This approach was intended to approximate proportional biological effects corresponding to increasing concentrations. However, biological responses to nanomaterials can be nonlinear or exhibit threshold behaviors, and therefore, this linear scaling assumption requires future experimental validation. Further research should investigate nonlinear modeling approaches to assess the validity and limitations of this assumption.
To prove the obtained model's proper predictability, we randomly selected four data points and moved them to an external validation set. These data points represent the whole set and are not used in model training or any other model optimization procedure. The rest of the 16 data points were used for training and internal validation of the model. Predictions of the training compounds were evaluated with R2 and RMSEc metrics, and the model showed a good goodness-of-fit. The robustness of the model was evaluated with the same metrics for CV and showed promising results. The Q2 metric, an analog of R2 for the training set, and RMSEEXT were calculated for the predictability evaluation. These statistics can be seen in Table 2. Overall assessment of them shows a good level of goodness-of-fit, robustness, and external predictability. It is important to note that not only does the absolute value of these metrics matter, but the difference between R2 and Q2 also mustn’t exceed 0.3.49,52 In the case of our nano-QSAR model, the difference is less than 0.1, which indicates that the model is not overfitted. A plot showing the distribution of predicted and observed values is shown in Fig. 2a. Although the model evaluation employed both external validation and CV, we acknowledge that the relatively small sample size (20 data points) may limit the generalizability of the findings. Further validation using larger, independent datasets is necessary to fully establish the model's predictive performance and mitigate overfitting risks.
| Statistic | Value |
|---|---|
| R 2 | 0.83 |
| RMSEC | 0.15 |
| Q CV 2 | 0.8 |
| RMSECV | 0.16 |
| Q 2 | 0.78 |
![]() | ||
| Fig. 2 (a) Observed-predicted plot presenting predictions’ accuracy in two-dimensional space; (b) Williams plot presenting the model AD. | ||
Completing the pre-processing steps, we have applied a combination of GridSearch and genetic algorithm (GA) to search for the best ridge hyperparameters and descriptors to train the nano-QSAR model. As a result, we got the ridge regression model eqn (2), a linear regression equation linking the TRI to MWCNT structural properties with good goodness-of-fit, robustness, and predictability.
| TRI = 0.53 × CHARGEA–A–B–B–A–A − 0.57 × CHARGEB–B–B–C + 0.37 × REFRACTIVITYB.C − 0.46 × CHARGEA.A | (2) |
An essential criterion in QSAR model evaluation is the applicability domain (AD).19 This method allows us to identify how similar the structures from validation or prediction are similar to those used for the model training. Also, it helps to identify the outliers from training or validation sets that are structurally similar but are predicted inaccurately. To define the AD for our nano-QSAR model, we constructed a Williams plot that displays both the structural differences of the MWCNTs via leverages and the accuracy of predictions via standardized residuals (Fig. 2b).
In Fig. 2b, we can see that all predictions for training and validation sets have absolute standardized residuals less than ±3. This indicates that there are no significant outliers – points for which the model's predictions are incorrect, even if the structure is similar to others. One data point, NP_3, exhibits a leverage of h = 0.85 (Table S5), indicating that it is structurally distinct from the other MWCNTs in the training set. NP_3 corresponds to the largest MWCNT (NM-401) tested at 162 μg per mouse concentration. While this is not a significant issue, provided that the prediction is accurate, further investigation of other MWCNTs at similar concentrations could be valuable for advancing both experimental and computational nanotoxicity studies. Such an extension of the existing data will provide essential data for filling the gap in the applicability of models predicting the transcriptomic response to MWCNTs inhalation.
Thus, the developed software ChemBioML Platform successfully integrated essential capabilities required for training the modern QSAR model. It includes comprehensive data preprocessing, a GA-based feature selection algorithm, a hyperparameter optimizer, and a ridge regressor for training robust, well-predictive QSAR models. Applying L2 regularization addresses common limitations like predictor multivariance and high predictor space.39,40 Furthermore, it evaluates the robustness and predictivity of the trained models by applying CV, external validation, and AD. Additionally, the ChemBioML Platform offers a graphical representation of the obtained models, as shown in Fig. 2, and demonstrates a significant optimization in combination with overall TRI methodology. A single model training, including feature selection and validation, took just 2
:
40 minutes with the use of an AMD Ryzen 5 3500U CPU (which is a usual processor equipped in office laptops). For example, if we were training models for each gene separately, it could take ∼230 hours with the same computer, which showcases a strong decrease in the amount of computational resources needed for work with complex transcriptomic data.
In the nano-QSAR equation (eqn (2)), it is evident that descriptors weighted by charge exert the most significant impact on the model's predictions. This might be expected, as the charge is a well-known MWCNT parameter, affecting their toxicity.55,56 Zeinabad et al. investigated the impact of the charge on the interaction between CNTs and biological systems such as cells and proteins, using the example of Tau protein. They hypothesized that the interaction mechanism depends on the charge distribution and protein surfaces.57 Our study extends this understanding by demonstrating that the charge of MWCNTs may significantly influence biological outcomes following their inhalation. Furthermore, while “CHARGEA–A–B–B–A–A” has a negative coefficient, the coefficients for other charge-weighted descriptors differ, suggesting that distinct charged regions contribute oppositely to the transcriptomic response. It is well known that the charge distribution on MWCNTs is uneven, higher in the tube ends.58 Our model indicates that regions with differing charges can exert opposing effects on the transcriptomic response. Notably, the descriptors “CHARGEA–A–B–B–A–A” and “CHARGEA.A” predominantly characterize pristine MWCNTs, whereas “CHARGEB–B–B–C” and “REFRACTIVITYB.C” are more common in functionalized variants; no descriptor was found to be exclusive to any specific functional group (Table S5). This suggests that the possible impact of the functional groups, such as –OH, –COOH, and –NH2 can be indirect, thereby affecting the charge and refractivity of the MWCNTs, which are crucial for the transcriptomic response. Importantly, in this study, we have used functionalized but uncoated MWCNTs. In the case of preparing nano-QSAR models for coated ENMs, an additional set of descriptors can be calculated by the Dragon software43 or using the SiRMS method. This approach would ensure that possible parameters of the additional molecules are counted during the modelling.
Thus, our results highlight that charge-weighted descriptors (i.e., descriptors reflecting the distribution of partial charges in a structure) play a significant role in predicting transcriptomic responses. This finding suggests a mechanistic link between MWCNTs’ electronic structure and the genes they perturb. One plausible explanation is that charge distribution influences how a molecule interacts with cellular components. For instance, highly electrophilic MWCNTs are known to activate cellular stress pathways. Literature examples support this: reactive electrophiles can trigger robust gene expression responses such as heat-shock protein induction59 and the activation of detoxification genes via the Nrf2-mediated electrophile response element.60 Our observation that charge-based descriptors are predictive aligns with these examples – molecules or materials bearing charged or polar functional groups may more readily form covalent bonds or induce oxidative stress, leading to upregulation of defensive transcripts. Conversely, compounds with certain charge features might have reduced cell permeability or distinct subcellular localization, affecting which genes are up- or down-regulated. For example, strong polar or ionic compounds may accumulate in specific cellular compartments or interact with membrane proteins, eliciting unique transcriptomic signatures. This notion is consistent with general toxicological principles: baseline cytotoxicity often arises from hydrophobic compounds integrating into membranes,61 whereas specific gene responses often correlate with reactive functionalities. In line with this, our model's emphasis on charge-weighted descriptors echoes findings from other QSAR studies that identified surface charge properties as key determinants of bioactivity.62
In addition to the TRI combined from PC1–19, we have tested the predictability of the index calculated with PC1–5, PC1–10, and PC1–15. The graph shown in Fig. 3 represents that R2 and QCV2 significantly increased for the TRI combined from PC1–19. This showcases that in our study, the full TRI not only explains 99.9% of the transcriptomic space consisting of 5167 genes, but also gives the best opportunities for training nano-QSAR models predicting the transcriptomic responses to MWCNTs exposure in lungs.
Although the developed model provides a powerful framework for exploring the relationship between MWCNTs' characteristics and biological responses, it has limitations. One primary limitation of this study is the small sample size of compounds, which constrains the diversity of chemical space explored. A limited number of MWCNTs increases the risk of overfitting and reduces confidence that the observed structure–response relationships will hold broadly. Indeed, prior analyses have shown that using too few MWCNTs can yield unstable gene expression signatures and poor predictive performance.63 Consequently, the AD of our model is narrow; the predictions may not generalize well beyond the MWCNTs represented in the training set.62 In our study, we provide and analyze the AD using the Williams plot from Fig. 2b. For all new predictions, AD also has to be calculated based on the studied MWCNTs structures, and only the predictions inside AD can be considered as accurate and used for further analysis. This restricted domain makes the model highly specialized and sensitive to outliers and underscores the need for a cautious interpretation of the results. Future experiments should expand the number and chemical diversity dataset to address this limitation. For example, including a broader range of MWCNTs would help ensure that the learned relationships are not idiosyncratic to a small subset. Likewise, collecting more samples per nanotube (e.g., across different concentrations) could improve the robustness of transcriptomic signals and reduce experimental variability. By explicitly acknowledging the sample size issue, we emphasize that further validation on larger, independent datasets is essential before the model's predictions can be generalizable. This limitation is not unique to our work – it mirrors challenges in many toxicogenomic modeling studies where a paucity of data can undermine external predictivity.62
| This journal is © The Royal Society of Chemistry 2025 |