Open Access Article
V.
Gkatsis
ab,
P.
Maratos
c,
C.
Rekatsinas
*b,
G.
Giannakopoulos
bd and
P.
Krokidas
*b
aDepartment of Informatics and Telecommunications, National and Kapodistrian University, Athens, Greece
bInstitute of Informatics & Telecommunications, National Center for Scientific Research “Demokritos”, Agia Paraskevi 15310, Greece. E-mail: crek@iit.demokritos.gr; p.krokidas@iit.demokritos.gr
cSchool of Electrical & Computer Engineering, National Technical University of Athens, Athens, Greece
dSciFY PNPC, Agia Paraskevi 15310, Greece
First published on 29th September 2025
Machine learning algorithms often rely on large training datasets to achieve high performance. However, in domains like chemistry and materials science, acquiring such data is an expensive and laborious process, involving highly trained human experts and material costs. Therefore, it is crucial to develop strategies that minimize the size of training sets while preserving predictive accuracy. The objective is to select an optimal subset of data points from a larger pool of possible samples, one that is sufficiently informative to train an effective machine learning model. Active learning (AL) methods, which iteratively annotate data points by querying an oracle (e.g., a scientist conducting experiments), have proven highly effective for such tasks. However, challenges remain, particularly for regression tasks, which are generally considered more complex in the AL framework. This complexity stems from the need for uncertainty estimation and the continuous nature of the output space. In this work, we introduce density-aware greedy sampling (DAGS), an active learning method for regression that integrates uncertainty estimation with data density, specifically designed for large design spaces (DS). We evaluate DAGS in both synthetic data and multiple real-world datasets of functionalized nanoporous materials, such as metal–organic frameworks (MOFs) and covalent-organic frameworks (COFs), for separation applications. Our results demonstrate that DAGS consistently outperforms both random sampling and state-of-the-art AL techniques in training regression models effectively with a limited number of data points, even in datasets with a high number of features.
The emergence of artificial intelligence (AI) and machine learning (ML) offers new opportunities in this area. ML models excel at finding patterns in data, often surpassing human capabilities.3–6 These models can take information about a material's characteristics (features) and predict how it will perform (properties). By doing so, ML can guide researchers in selecting which materials to study, reducing experimental effort and cost.7 However, reliable ML models require high-quality data for training, and generating such data is expensive and labor-intensive. This creates a paradox: while we aim to reduce experimental costs, creating the large, diverse datasets needed for training these models remains costly. To address the limitations of conventional data collection methods, researchers are exploring strategies that go beyond random sampling (RS)—the simplest approach for selecting candidate instances to build training datasets in materials science. In RS, samples are chosen randomly and independently from a larger pool of possible materials. This pool, once mapped to a vector space where each dimension represents a structural or compositional property, is commonly referred to as the design space of the materials. For each selected sample from this design space, an experiment—computational or experimental—is conducted to determine the values of the target (dependent) variables. The resulting instance, consisting of input features and corresponding outputs, is then added to the dataset. While RS is straightforward and easy to implement, it does not incorporate any prior knowledge about the distribution or structure of the data. As a result, it may frequently select samples that are redundant or unlikely to improve the model's performance—commonly known as uninformative samples—particularly in low-data regimes where every labeled point carries significant weight.
In such cases, active learning (AL) techniques are more appropriate, as they aim to strategically select the most informative samples, thereby maximizing model improvement while minimizing the number of required labeled instances.8,9 Active learning is a semi-supervised learning method meaning that target values of the dataset are partially unknown, and the machine learning model is trained by selecting data points one by one and querying their target values to an oracle. This method uses the acquired knowledge about the data space in order to effectively guide the selection of the next data point, usually by evaluating an uncertainty measure. After the query to the oracle that annotates the data samples, the obtained feature-value pair is added to the training set thus updating the model's knowledge of the data space. Using this technique allows researchers to focus on the most informative samples, thus optimizing the process. AL techniques often use concepts like diversity10 (choosing samples that differ significantly from each other) and representativeness11 (choosing samples that best represent the dataset) to guide sample selection.
Some active learning methods, known as model-based approaches, rely on the ML model to guide the identification of the samples to annotate, focusing on those most likely to improve predictions. A seminal example is the work by Cohn et al.,12 who proposed selecting samples to reduce model uncertainty. However, it can be computationally intensive—particularly for neural networks—and relies on assumptions (e.g., Gaussian noise, negligible bias) that may not always hold, limiting its scalability and generality. AL has been extensively used for classification tasks, where the selection criterion often relies on entropy-based uncertainty measure,13,14 vote entropy15 and expected model change.16 However, in regression tasks, where the computation of entropy is infeasible, the AL bibliography is limited, and the main techniques require different criteria to substitute the uncertainty measure. Regarding regression tasks, approaches such as greedy sampling (GS), combine diversity and representativeness to improve predictions by greedily selecting the data sample that maximizes on a specified criterion: GSx focuses exclusively on the exploration of the feature space, while GSy prioritizes target property space exploration through the model's predictions.17 Although both methods manage to adequately learn the design space, their individual predictive performances are hindered due to their lack of insight into each other's data space domain (target property space for GSx and feature space for GSy). To this end, the improved GS method (iGS) was devised to combine both methods, achieving remarkable results.18,19 Another prominent technique, expected model change maximization (EMCM),20 evaluates the potential impact of annotating a sample on the current model and selects the sample that leads to the greatest change in the model's parameters, measured as the difference between the current model parameters and the updated parameters after training with the enlarged training set. This method works under the assumption that the greatest parameter change is correlated with significant learning opportunities in the design space. While effective, methods like EMCM can be computationally intensive as the model has to constantly estimate the gradient of the loss and update all model parameters for each new annotated sample of the space. This led to the development of batch strategies such as B-EMCM21 to address these challenges.
Recently, researchers have explored Mondrian trees,22,23 which is a type of regression tree that branches randomly rather than based on features. While they can achieve modest improvements over other state-of-the-art methods, the high variance in predicted values within each leaf node and reliance on scaling datasets to a fixed range ([0, 1]) can limit their practicality.23 Furthermore, emerging AL techniques now incorporate advanced tools like Bayesian models, Gaussian processes (GP), and deep learning. For example, Gaussian processes can model data uncertainty but are mainly used for low-dimensional datasets. Similarly, deep learning-based methods, such as batch model deep active learning (BMDAL),24 are designed for large datasets and may not suit applications where data annotation is expensive.
In this work, we address a critical limitation of AL which is that there are problem cases where it struggles to significantly outperform baseline sampling methods on finding the most informative data points and efficiently training ML models using them. This happens when the data space is not homogeneous, meaning that the data samples are not uniformly distributed across the feature space hypercube domain and form dense and sparse regions resulting in the decrease of pure exploration AL framework's performance. This is because a pure exploratory AL framework such as iGS mainly selects samples from sparse regions as they are more diverse to the already explored space, while simpler methods such as RS follow the underlying density distribution and select more samples from the denser regions thus optimizing the predictions of the model. To overcome this, we propose an AL framework for regression tasks that incorporates density-awareness to the selection process of improved greedy sampling, called density-aware greedy sampling (DAGS). For classification tasks, modeling the density of the data space is common as the framework has to ensure that the selected sample is both informative and representative of its class.25–28 However, density-awareness has been largely overlooked in the active learning literature for regression tasks, and this constitutes the main contribution of our work. Our method explicitly exploits density as a characteristic of the design space, allowing us to balance exploration with representativeness and thereby select more informative samples. In this way, we address the limitations of iGS, which often overemphasizes outliers and expends oracle queries on points that contribute little to model improvement. Our results further show that the proposed density-aware approach can match or even surpass random sampling, which implicitly reflects data density to some extent. Finally, we benchmark DAGS not only against random sampling but also against more sophisticated active learning strategies, including query-by-committee,29 regression tree-based AL,30 and plain iGS.17
To evaluate the performance of our proposed framework against the aforementioned techniques, we first constructed synthetic datasets based on four distinct formulas. Each formula is examined in two versions: (a) homogeneous and (b) non-homogeneous distributions of data points. Following this controlled evaluation, we apply the framework to a real-world scenario involving complex sample spaces of materials with high correlation complexities and heterogeneity. Specifically, we focus on metal–organic frameworks (MOFs),31 a class of functionalized materials whose structures can be modulated at the molecular level. MOFs exhibit exceptional potential as adsorbent/storage materials32 or components in separation membranes.33 However, understanding how their design influences performance remains a complex challenge, often requiring either labor-intensive experiments or computationally demanding in silico simulations. Our proposed framework aims to address these challenges by improving the efficiency and accuracy of predictive modeling in such complex material systems. In both the synthetic data and MOF datasets, our approach consistently outperforms the other methods demonstrating superior performance compared to them.
To obtain the target values yi for a specific input ui, the user must run one iteration of the expensive process f. However, when the locations of inputs yielding desirable outputs are unknown within the design space, the user may need to evaluate many such inputs, resulting in high overall cost.
To address this, we propose to train a machine learning model M that approximates the mapping f: U → Y. In this way the user can have a good estimation of target values of each ui and thus will be able to run targeted iterations of f, for those ui predicted to have a target value yi closer to the desired. Training the model M also means acquiring target values for each ui that will be used as the training dataset. In order to create an efficient model, we want to find the balance between maximizing the model performance and minimizing the dataset creation cost. A low cost means that M should use the least amount of training data, so that the number of iterations of f performed is reduced as much as possible without significant loss of estimation performance. For this reason, we assume a limited budget of N available iterations. Let L ⊆ U be the subset of inputs selected for training, such that:
| L = {l1, l2, … lN}, li ∈ U | (1) |
Initially, L will contain k < N randomly selected samples. For each randomly selected sample, an iteration of f is performed and its target value is acquired, and the machine learning model is trained with L as the training dataset. Then, we define a selection process, s, which given the current state of the model and the training dataset, identifies the next element that should be used for training
| Li+1 = s(Li, M) | (2) |
The elements are sampled one by one; for each one, an iteration of f is performed, and after its target value has been acquired, the machine learning model is retrained with Li+1 as the training dataset.
Our goal in this research work is the creation of an algorithm serving as a selection process s, which will efficiently achieve this balance between model performance and data creation cost.
Fig. 1 provides a graphical representation of the problem case that we are exploring.
![]() | (3) |
![]() | (4) |
![]() | (5) |
This strategy assumes that areas of disagreement represent regions with high uncertainty, making them valuable for improving the model's performance. While QBC minimizes overfitting when predictors are diverse, such as using models from different learning paradigms, it suffers from limitations in regression tasks. Specifically, its focus on the target property alone for query selection often leads to suboptimal performance in cases with complex feature–target correlations, such as MOF datasets. In our implementation, the models used were XGBoost,38 random forest, and Gaussian process regressors – studies have shown that two or three predictors are generally sufficient.39 Despite its conceptual appeal, QBC's performance is often inferior to more balanced exploration–exploitation approaches like iGS and density-based methods, particularly in high-dimensional regression problems. It is also worth mentioning that the need for multiple regressors (those forming the committee) may make this approach more expensive than others, since each time a new data point is sampled we need to re-train them all. The query-by-committee code used in this work was developed by the authors.
![]() | (6) |
![]() | (7) |
![]() | (8) |
![]() | (9) |
![]() | (10) |
![]() | (11) |
Extensive benchmarking demonstrates the ability of RT-AL to achieve lower error rates with reduced sample sizes compared to other state-of-the-art methods, particularly in datasets with complex distributions. The method's robustness and low variance make it a reliable choice for regression tasks across diverse application domains. For our implementation of RT-AL, we have adapted and used the code provided by Jose et al.30
The aforementioned methods work well in scenarios where data are uniformly distributed across the design space. However, many real-world datasets exhibit imbalances, with dense and sparse regions in the design space. In such cases, purely explorational AL techniques may underperform, and even random sampling can outperform these methods.
![]() | (12) |
Using this factor, the next sample is selected by:
![]() | (13) |
This approach has been inspired by the work of Zhu et al.,8 where a similar strategy was proposed for classification tasks. Specifically, the authors devise an uncertainty-based active learning framework named sampling by uncertainty and density (SUD), where the selection criterion consists of the multiplication of an uncertainty and a density factor. In this method, the uncertainty for each unlabeled sample is modeled as the entropy of the estimated probabilities for the sample to belong in each class. The density factor is computed as the average cosine similarity of the sample x with its K-nearest neighbors. The two factors are then multiplied to produce the final selection metric for the unlabeled set. The main drawback of this approach when implemented for regression tasks is that calculating the entropy for each sample is challenging as there are no well defined classes (we can either model each sample as a separate class or rely on clustering methods which make the entropy computation inefficient and inaccurate).
To tackle this problem, in our method, we substitute the entropy with the iGS factor which adequately represents the uncertainty of a design space and is suitable for regression applications. Another difference between the two methods is that we define density as the average inverse of the sample's Euclidean distances with its neighbors and not the average of their cosine similarities. This decision has been made based on the assumption that, after querying a sample, the model gains knowledge of the target property's behaviour on a small area around it, as samples with almost identical feature space values will probably exhibit approximate target property's values. Conforming with this assumption, a dense area should consist of samples that have absolute proximity and not necessarily the same direction of feature vectors. In general, the Euclidean distance provides a more intuitive and reliable measure of the density “neighborhood”, particularly in continuous spaces where absolute distances are critical as it directly measures spatial proximity. In conclusion, our density-based method selects samples that maximize exploration while prioritizing dense areas, ensuring the selected samples provide the most significant knowledge about the design space. This helps reduce the average prediction error, as sparse areas often represent outliers with little relevance.
![]() | ||
| Fig. 2 Given a design space, iGS computes uncertainties (v) and K-NN computes densities (d) for all samples. DAGS selects the sample that maximizes the product of v and d. | ||
In order to mitigate concerns that the final results are due to dataset peculiarities, we use a k-fold routine where we shuffle and divide the design space in k consecutive folds. Then, we select k − 1 folds as the training set while the remaining fold becomes the test set. In our work-flow, we set k = 10.
As mentioned before, the training set Li+1 is built by evaluating the model on the previous training set Li and then adding the next sample proposed by the selection method. To bootstrap this process, we initialize L0 by randomly selecting 5 samples from the design space. We ensure that, throughout our experiments, the five initially selected random samples remain the same for each dataset. Maintaining this consistency prevents random chance from significantly influencing our results.
The five selection methods are evaluated using the mean absolute error (MAE) metric, which is expressed as follows:
![]() | (14) |
The whole evaluation workflow as described above is performed 10 times. Finally, we plot the average MAE across the 10 experiments for increasing training dataset sizes. The code used for the experiments is openly accessible in our GitHub repository.†
The predictive model being used in our experiments is the XGBoost regressor.38 Details regarding the Python libraries and the hyperparameters of these ML regression models are provided in the SI. For density calculations in the feature space, we used Euclidean distances without applying prior normalization. We acknowledge that omitting normalization can be problematic in very high-dimensional spaces or when feature values differ by several orders of magnitude. In our datasets, however, the number of features is modest (up to 20), and their ranges vary only within a few orders of magnitude. Under these conditions, we chose to focus on demonstrating the impact of incorporating density itself into sampling strategies. Nonetheless, feature normalization remains an important consideration for future work, particularly within a more generalized framework.
The first space is modelled after the 1d Forrester benchmark40 (Fig. 3), which is commonly used for evaluating Bayesian optimization methods, as we want to examine the learning capabilities of the AL frameworks on a continuous yet complex data space, where we select 1000 x samples within the range [0, 1]. The target property can be calculated using the following formula:
y(x) = (6x − 2)2 sin(12x − 4) | (15) |
The next space is a variation of the first, called 1d Jump Forrester (Fig. 4) which inserts a discontinuity at the target function as we want to capture the effect of non-continuous target properties on the performance of the AL frameworks.
![]() | (16) |
The third one is modelled after a 2d Gaussian (Fig. 5) in order to simulate an area of interest at the center of the space and evaluate the degree that each method effectively learns the space when data samples are uniformly scattered or create an extremely dense area at the center, with x1, x2 ∈ [−3, 3], and the target value is produced by:
![]() | (17) |
Finally, the last space has x1, x2 ∈ [0, 1] and y is an exponential form of x (Fig. 6) as we want to model a design space that has (complementary to the Gaussian space) the area of interest at the border of the space, examining if a pure exploration AL method performs well in this context. The y is expressed through:
| y(x) = 1 − exp(−0.6((x1 − 0.5)2 + (x2 − 0.5)2)) | (18) |
In the following figures, we showcase the performance of various sampling methods, measuring the mean absolute error (MAE) as a function of the number of samples annotated (we designate this number as “# of queries” in the figures, since these annotations are essentially queries towards an “oracle”). The results show that the iGS and DAGS methods outperform random sampling in all homogeneous spaces (Fig. 7(a), 8(a), 9(a) and 10(a)), because they operate in a strategic and exploratory manner, efficiently identifying the most informative data points based on their position in the design space. An important observation is that in homogeneous spaces, where the density of data points is nearly uniform across the entire space, our method effectively reduces to iGS, as the density factor is almost identical for every unknown point.
![]() | ||
| Fig. 7 MAE as a function of the number of queries for Forrester (a) homogeneous and (b) heterogeneous datasets. A lower MAE means better predictive capabilities of the model. | ||
![]() | ||
| Fig. 8 MAE as a function of the number of queries for Jump Forrester (a) homogeneous and (b) heterogeneous datasets. A lower MAE means better predictive capabilities of the model. | ||
![]() | ||
| Fig. 9 MAE as a function of the number of queries for Gaussian (a) homogeneous and (b) heterogeneous datasets. A lower MAE means better predictive capabilities of the model. | ||
![]() | ||
| Fig. 10 MAE as a function of the number of queries for exponential (a) homogeneous and (b) heterogeneous datasets. A lower MAE means better predictive capabilities of the model. | ||
In heterogeneous spaces, however, the performance of AL frameworks compared to random sampling is less straightforward. Notably, AL methods that disregard the density distribution of the design space, such as iGS, often fail to outperform RS. Specifically, iGS exhibits reduced performance in the Forrester space (Fig. 7(b)) when compared to the density-based method, shows nearly identical performance to RS in the Jump Forrester benchmark (Fig. 8(b)), and suffers complete performance degradation in the Gaussian space (Fig. 9(b)). The poor performance of iGS in the Gaussian benchmark can be explained by its tendency to select points far from the center, as it prioritizes coverage of the entire design space. This approach neglects the fact that, in the heterogeneous case, more than half of the data samples are concentrated in the central region, where selecting points is critical for achieving a significant reduction in mean absolute error (MAE). The only heterogeneous design space where iGS performs well is the exponential space (Fig. 10(b)), where it predominantly selects samples from the edges of the space, focusing on modeling the area of interest rather than the central region.
In contrast, the DAGS method performs robustly across both homogeneous and heterogeneous spaces. In homogeneous spaces, it operates in a purely exploratory manner, similar to iGS. In heterogeneous spaces, however, it effectively captures the underlying density distribution of the design space. This adaptability enables our method to consistently outperform RS across all synthetic benchmarks. In rare cases, such as the exponential space (Fig. 10(b)), its performance is comparable to iGS which indicates that at extreme data space heterogeneity scenarios where the area of interest requires a purely exploration criterion, and the density factor in our method leads to the selection of some suboptimal points. Overall, the density-based AL framework demonstrates superior versatility and effectiveness, making it a more reliable choice for diverse design spaces.41 In all cases we set our query budget N at 150 data points, at which point we stopped the sampling. Out of those, 5 were initially randomly selected and 145 selected by each method. For design spaces of 1000 or 2000 samples, 150 queries represent 15% and 7.5% of the whole space, respectively, as we opt for simulating realistic training size – design space size ratios in order to test the efficiency of the proposed method.
We utilize five datasets from the literature, each comprising thousands of MOFs characterized by structural and chemical descriptors (or attributes) as input features for model training. These datasets were chosen not only for their size and availability but also because they address the relatively underexplored property of gas diffusivity, as opposed to the more commonly studied sorption capacity or uptake. The target property, diffusivity (Di), typically measured in either m2 s−1 or cm2 s−1, represents the rate at which penetrants (guest molecules) of species i (commonly gases such as CO2, CH4, N2, and O2) propagate through the porous structure of a material. Target values for diffusivity were obtained through in silico experiments, specifically molecular simulations.
Gas diffusivity is underrepresented in high-throughput simulation schemes due to its higher computational cost relative to sorption properties. This is primarily because calculating diffusivity requires smaller time steps for numerical integration of the equations of motion, resulting in significantly longer simulation times.43 By addressing this property, we aim to highlight the applicability and efficiency of active learning frameworks in domains with high computational complexity.
Similarly to the synthetic dataset, in Fig. 11, we showcase the mean absolute error score as a function of the number of samples annotated for O2 and N2 datasets. A general observation across both datasets (N2 and O2) is that random sampling, despite its simplicity, performs surprisingly well and serves as a challenging competitor for many state-of-the-art methods. Among the tested methods, query-by-committee struggles to perform well, showing higher MAE values throughout. The iGS method performs better than QBC but still lags behind both random and RT, with the latter closely matching the performance of random. The moderate performance of RT can be attributed to its reliance on randomness for sample selection, which assumes that points with similar characteristics (MOFs with similar chemical and structural properties) will exhibit similar diffusion performance. However, this assumption does not appear to hold true for diffusivity.
![]() | ||
| Fig. 11 MAE as a function of the number of queries for (a) N2 and (b) O2 diffusion in MOFs. A lower MAE means better predictive capabilities of the model. | ||
In contrast, the DAGS method demonstrates a clear advantage over all other methods. It not only achieves significantly lower MAEs in the early stages of sampling but also reaches a much lower plateau. For example, with 145 samples, RT and random both achieve a MAE of approximately 0.55 (N2) and 0.45 (O2). In comparison, our approach reaches the same MAE values with only approximately 60 samples (N2) and 90 samples (O2), drastically reducing the number of queries required to achieve similar accuracy by a factor of 2.4 and 1.6, respectively.
This efficiency translates directly to significant time and cost savings in the lab. By requiring fewer annotations to achieve the same prediction accuracy, DAGS minimizes experimental effort and resource use, making it a powerful and practical choice for active learning in materials science applications.
The results for CH4, H2 and He datasets (Fig. 12) largely align with the trends observed in the N2 and O2 datasets, as described previously. Random sampling continues to exhibit strong performance, proving itself as a robust baseline method. Query-by-committee (QBC), however, consistently underperforms, showing the highest MAE values across all datasets. The iGS and RT methods again demonstrate improved performance, with RT closely matching random in most cases. Notably, the He dataset is the only scenario where iGS outperforms all other methods, achieving the lowest MAE, while our density-based method comes second, alongside random.
![]() | ||
| Fig. 12 MAE as a function of the number of queries for (a) CH4, (b) H2 and (c) He diffusion in MOFs. A lower MAE means better predictive capabilities of the model. | ||
For the CH4 and H2 datasets, our method consistently outperforms all other approaches. It achieves a significantly lower MAE in the early stages of sampling and maintains its advantage throughout. For example, for CH4, with 145 samples, RT and random reach a MAE of approximately 0.59 and 0.58, respectively, whereas the DAGS method achieves the same MAE with just 100 and 120 samples. This is a reduction of queries by 1.5 and 1.2 times, respectively. Similarly, for H2, RT and random both achieve final MAEs of 0.41 at 145 samples, while our method reaches these values with only 90 samples, reducing the number of queries performed by a factor of 1.6. This significant reduction in required annotations translates directly to time and cost savings in the lab.
In the He dataset, however, iGS achieves the best performance, highlighting that certain methods may excel in specific scenarios. Nevertheless, the DAGS method still performs competitively, achieving a MAE score of 0.31 at approximately 90 samples, while RT and random achieve the same score with 145 samples and the iGS method gets the same score at 65 samples. These results rank DAGS second among the state-of-the-art methods as it requires 1.6 times less queries than RT and random and 1.5 times more queries than iGS for this specific task. We note that the He dataset has the highest ratio of training size to design space, with the training set covering nearly one quarter of the space. In such cases, a purely explorational method like iGS can afford to query all outliers and still have sufficient budget to cover the denser regions. In smaller design spaces, therefore, incorporating outlier information can improve performance relative to methods that prioritize dense regions. Overall, these findings highlight the robustness and versatility of the density-based approach, especially in datasets with high complexity and imbalanced distributions.
The objective of this work was to highlight the importance of incorporating spatial characteristics, such as heterogeneity, into the selection criterion of AL frameworks and to provide an initial step in this direction. As future work, we propose the development of density metrics for design spaces to better capture and exploit inherent heterogeneity. Such metrics could serve as valuable tools for selecting the most suitable AL framework for a given problem, ultimately improving predictive performance and data efficiency. We also propose further experimentation with our proposed method in new areas of application as well as in real experimental campaigns.
The code for all AL methodologies demonstrated in this work can be found at https://github.com/insane-group/Density_Aware_Greedy_Sampling. The version of the code employed for this study is version 1.0.0. This study was carried out using publicly available data from ibarisorhan/MOF-O2N2 at https://github.com/ibarisorhan/MOF-O2N2/blob/main/mofScripts/MOFdata.csv with [accession number], for the O2 and N2 diffusion in MOFs cases, and from hdaglar/MOF-basedMMMs_ML at https://github.com/hdaglar/MOF-basedMMMs_ML/blob/main/rawdata.zip with [accession number], for the CH4, H2 and He diffusion in MOFs cases.
Footnote |
| † https://github.com/insane-group/Density_Aware_Greedy_Sampling. |
| This journal is © the Owner Societies 2025 |