Ignas
Pakamorė
* and
Ross S.
Forgan
WestCHEM, School of Chemistry, University of Glasgow, Joseph Black Building, University Avenue, Glasgow G12 8QQ, UK. E-mail: ignas.pakamore.research@gmail.com
First published on 29th May 2025
Navigating the reaction parameter space can pose challenges, especially considering the exponential growth in the number of parameters even in seemingly straightforward chemical reactions or formulations. Consequently, recent research efforts have been increasingly dedicated to the development of computational tools aimed at facilitating the exploration process. Herein, we introduce ChemSPX, a Python-based program specifically crafted for exploring the complex landscape of reaction parameter space. We propose the use of the inverse distance function to map reaction parameter space and efficiently sample sparse regions. This is implemented in ChemSPX to allow the user to simply generate sets of reaction conditions that efficiently sample wide parameter spaces. In addition, the program includes tools necessary for the analysis and comprehension of the multidimensional parameter space landscape. The developed algorithms were utilized to experimentally investigate the hydrolysis of N,N-dimethylformamide (DMF), a commonly employed solvent, in the specific context of metal–organic framework synthesis. We use ChemSPX to generate batches of experiments to sample parameter space, starting from an empty space, but subsequently assessing under-sampled regions. We use statistical analysis and machine learning models to show that addition of strong acids induces hydrolysis, generating up to 1.0% (w/w) formic acid. The results show that ChemSPX can generate datasets that efficiently sample parameter space, in this case allowing the user to distinguish the individual effects of five different physical and chemical variables on reaction outcome.
Over the past decade, algorithms for exploring reaction parameter space have emerged, encompassing machine learning-driven approaches or various alternative strategies.15,18,19 Machine learning algorithms thrive with ample data, but the challenge lies in acquiring and ensuring the quality of such datasets. For example, literature-based synthesis databases suffer from inherent biases, where published research is not representative of all studies.20 This bias arises from factors such as the fear of journal rejection for negative results. Additionally, the presence of fraudulent manuscripts, as seen in cases of counterfeit publications in the field of metal–organic frameworks (MOFs), raises concerns about the reliability of data.21,22 Anthropogenic biases, such as heuristics and social influences in chemical reactions, further impact data-driven planning efforts, hindering the objectivity of literature-based synthesis databases.23 Additionally, compiling such extensive databases can be complex and, at times, seemingly impossible.
In this study, we introduce the ChemSPX program, which facilitates the analysis and exploration of a predefined reaction parameter space autonomously, without dependence on prior knowledge or experimental outcomes. The search algorithm employs an inverse distance function to maximise the separation between sampled reaction condition vectors in n-dimensional parameter space, thereby enhancing exploration. The algorithm aims to methodically navigate through the parameter space, sampling a minimum number of unique reaction formulations that have not been explored before. Sequential sampling is focused on sparse regions of the parameter space with the highest potential for yielding novel and unexplored reaction conditions. This strategic approach aims to efficiently provide human researchers with a feasible number of experiments, streamlining the experimental process and maximising the information gained from each iteration. The primary motivation behind developing the ChemSPX program is to create a user-centric application that aids researchers working on small or medium-scale projects, especially in the absence of experiment automation.
We demonstrate the application of ChemSPX code by exploring the parameter space of N,N-dimethylformamide (DMF) hydrolysis. DMF is a widely used aprotic solvent with a high boiling point (152.85 °C) and is miscible with water and other solvents.24,25 Its versatile physical and chemical properties make it a common choice in both small-scale and industrial chemical processes; it has been widely applied to the synthesis of MOFs, for example. A key characteristic of DMF is its decomposition under certain conditions. Upon heating, DMF undergoes decarbonylation, producing carbon monoxide (CO) and dimethylamine (DMA).26 Additionally, in the presence of water, DMF undergoes gradual hydrolysis, resulting in the formation of formic acid (FA) and DMA.27,28 Therefore, these two byproducts, DMA in particular, serve as predominant impurities in DMF, contributing to its distinct amine odour.
Aqueous DMF hydrolysis can be accelerated with an acid or base catalyst.29–31 MOF syntheses typically employ neat DMF, but commonly used additives like acidic modulators and co-solvents, such as water,32 are expected to promote DMF hydrolysis. The resulting products play crucial roles in influencing the kinetics of MOF formation: FA acts as a modulator and/or a source of protons, while DMA serves as a Brønsted base, deprotonating carboxylate ligands.33 The slow release of base upon gentle heating allows pH control in solvothermal reactions, which are widely utilised in synthesising various MOF materials, for example, the well-known MOF-5.34
The hydrolysis mechanism of DMF under acidic conditions has been thoroughly studied and understood.30,35–37 Typically, DMF hydrolysis is qualitatively assessed in the literature through NMR spectroscopy. However, only a limited number of studies have specifically tackled the quantification of the generated products, formic acid and dimethylamine.38,39 The complexity of this reaction, influenced by numerous parameters, necessitates a comprehensive investigation to identify optimum conditions that minimise byproduct yield. Herein, we exemplify the power of the ChemSPX program to analyse and sample parameter space in the hydrolysis of DMF under conditions relevant to MOF synthesis.
The diverse reaction conditions can be visualized in an n-dimensional parameter space through multidimensional vectors. The mathematical distance between vectors (eqn (1)), signifies the difference between two sets of reaction conditions. This distance metric quantifies the dissimilarity or separation between the multidimensional vectors , providing a measure of how distinct or similar the corresponding reaction conditions are.
![]() | (1) |
Therefore, the probability of two reaction conditions producing identical results is proportional to the inverse of the distance between their respective multidimensional vectors (eqn (2)).
![]() | (2) |
This mathematical relationship underscores the concept that closely positioned points in the parameter space are associated with more comparable reaction conditions, while greater distances suggest greater dissimilarity or divergence in the outcomes. Therefore, to select reaction conditions from the parameter space with the lowest likelihood of producing identical or similar outcomes, it is crucial to maximise the spatial distance between the corresponding vectors.
![]() | (3) |
Applying this concept, we sample new reaction conditions from regions characterized by the lowest ϕ values, which indicate the highest probability of producing diverse reaction outcomes.
The magnitude of the inverse distance measure is determined by two variables: the number of nearest neighbours, considered N; and the exponent, b. Adjusting these two parameters alters the distribution landscape of ϕ in the parameter space (Fig. 1a).
The reciprocal power b serves as a critical factor in determining the proximity threshold between the reaction vectors. As the reciprocal power increases, the sharpness of the change in ϕ values at low distances diminish (Fig. 1b). This implies that at higher values of b, reaction condition vectors can be sampled at closer proximities. Furthermore, increasing the number of nearest neighbours N results in a reduction of ϕ values (Fig. 1c). In other words, as N increases, the impact of nearby neighbours on the inverse distance function ϕ diminishes. Therefore, to evaluate unexplored local regions effectively, careful tuning of the N parameter is essential. Overall, while reciprocal power allows control over the proximity of vectors, the number of nearest neighbours steers ϕ towards local minima in the parameter space.
B = [A − Aχ,A + Aχ] | (4) |
After constructing the subspace hyperrectangle, an optimisation algorithm is employed to minimise ϕ within the vicinity of B and identify a new vector with the minimum ϕ value.
The implemented optimization algorithm, as illustrated in Fig. 2, iteratively focuses on optimizing the coordinates of individual vectors within the sampled set. By systematically refining each vector's coordinates, the algorithm gradually improves the overall distribution and convergence of the sampled data points, leading to enhanced representation and effectiveness within the parameter space. At each algorithm iteration step, the average ensemble properties (denoted in 〈ϕ〉) of sample parameter vectors are calculated.
The success of the search of sparse regions is defined by three convergence criteria: the average inverse distance function; the average change in inverse distance function; and the average sample vector change (eqn (5)–(7)).
![]() | (5) |
![]() | (6) |
![]() | (7) |
The vector change is determined by computing the difference between vectors at iteration i and i + 1 and expressed as the modulus of the resultant (eqn (8)).
![]() | (8) |
This parameter enables the assessment of whether the vector has converged in specific coordinates, and hence if the algorithm can be terminated. The convergence of the vector occurs when its coordinates either remain unchanged or exhibit minimal changes between successive iteration steps. It is also possible to use ChemSPX to search and optimise an existing parameter space that has been populated manually or by alternative computational methods. When employing a pre-populated parameter space for analysis and sampling, the existing reaction vectors remain static while the sampled vectors are optimised in relation to each other and the static vectors. This optimization process aims to enhance the coherence and distribution of the sampled vectors within the parameter space, ensuring that they align effectively with the existing data points while also maintaining appropriate spacing and coverage across the entire space.
The plots presented in Fig. 3, illustrating the minimization of the test functions, reveal the significant influence of the step size parameter χ on the convergence behaviour. In particular, for the Ackley function – known for its complex landscape with numerous local minima – the Genetic Algorithm optimizer exhibits difficulty in locating the global minimum when smaller χ values are used. In contrast, for the Matya's and Himmelblau functions, convergence to the global minimum is consistently achieved when the step size is greater than or equal to 0.1. These observations highlight the critical role of appropriately tuning the χ parameter to guide the optimization process toward either local or global convergence.
![]() | (9) |
The new sampled reaction vector is the centre point Z of a hypersphere S, encompassing the smallest number of reaction vectors A. In the consecutive sampling, all parameter space is explored with new data points considered. The search radius can be manually controlled and is defined by eqn (10):
![]() | (10) |
Zi = {z0,z1,…,zn} | (11) |
The radius of the hypersphere can be tuned based on the previously discussed vector change factor (eqn (8)). The size of the defined hypersphere affords control over the regions that are explored within the parameter space. The more vacant regions will fit a larger hypersphere and vice versa.
When compared to the aforementioned inverse distance function method, the void search algorithm exhibits a higher level of stochasticity. As the initial positions for the genetic algorithm are randomly determined, there is no guarantee of convergence to identical coordinates between iterations. Therefore, the void search algorithm proves effective in discovering unsampled regions while retaining a stochastic nature that reduces bias in parameter selection. The void search algorithm can be seen as an extension of the inverse distance function, facilitating the rapid identification of large unpopulated parameter space regions.
The DMF hydrolysis parameter space was sampled using the developed ChemSPX package. Two sampling methods were employed to search the parameter space: discrete and uniform (outlined in Table 1). For experimental simplicity, the DMF volume was kept constant. For the time parameter, a mixed sampling method was used, obtaining data in short (1–12 h, uniform) and long (24–168 h, discrete) periods. Additionally, a uniform method was applied to sample continuous parameters of moles of an acid catalyst and water volume. Taking all parameters into account, there are approximately 672000 possible reaction conditions. Hence, in silico reaction condition sampling proves instrumental for a more efficient and focused examination of the parameter space.
The reaction formulations were systematically sampled within the specified bounds, as outlined in Table 1. The complete reaction parameter space sampling was executed in five batches (Table 2). Variations in batch sizes were employed to track improvements in both the correlation between generated formic acid and reaction parameters, as well as enhancements in machine learning model performance (vide infra). The first batch of 50 experiments were selected using LHS and kept exempt from equilibration by the inverse distance function, thus creating a pre-populated parameter space. To exemplify the features of ChemSPX, the subsequent batches 2–5 were samples by both the LHS method (batches 2 and 3) and the void search algorithm (batches 4 and 5). Every successive set of reaction conditions generated by ChemSPX was added to the pre-existing reference dataset, thereby blocking already explored regions. The initial formulations acquired in batches 2–5 were refined through additional iterations utilising the inverse distance function ϕ, aiming to identify its minimum. The ϕ values for individual data points were calculated by considering their four nearest neighbours. The reciprocal power b was set to 1 aiming to maintain a larger distance between sampled reaction parameter vectors. A genetic algorithm optimiser, configured with optimisation step size χ set to 0.01, was employed to locate the minima of the ϕ function. Genetic algorithm optimizations consisted of 80 cycles with a population size of 100, while other parameters were maintained at default program settings (see ESI, Section S3†). A total of 120 inverse distance function optimisation cycles were performed, with ϕ, Δϕ, and parameters achieving convergence.
Batch | Size | ChemSPX initial sampling method | ϕ minimisation applied |
---|---|---|---|
1 | 50 | LHS | No |
2 | 20 | LHS | Yes |
3 | 21 | LHS | Yes |
4 | 21 | Void | Yes |
5 | 40 | Void | Yes |
To evaluate the extent of exploration within the defined parameter space, we employ the Monte Carlo integration algorithm (refer to ESI, Section S8† for detailed methodology). The computed exploration fraction quantifies how much of the parameter space has been covered, varying between 0 (entirely unexplored) and 1 (fully explored). In the consecutive sampling of the parameter space, from batches 1 to 3, the fraction of the explored parameter space increases sharply to 0.55 with 91 sampled reaction conditions (Fig. 4). Furthermore, the exploration rate drops sharply with the addition of an extra 61 reaction formulations. These observations indicate that most of the reaction parameter space is explored with batches 1 to 3. The reduced exploration rate implies that beyond batch 4, there is a transition from exploration to exploitation of the parameter space. The Monte Carlo calculations reveal that 57% of the parameter space has been explored and showcase the effectiveness of the ChemSPX algorithm in uncovering under-sampled regions of the parameter space.
![]() | (12) |
Each experiment was carried out in triplicate to obtain the standard deviation of the measurement.
Prior to analysis of the data, a systematic assessment of the impact of 1H NMR spectroscopic measurement errors on the entirety of our experimental data was carried out. The analysis procedure demanded a meticulous level of precision in both the preparation of the deuterated solvent containing the standard reference (TMS) and the analyte. Despite the potential for errors during sample preparation and measurement, the error distribution across the entire dataset reveals a remarkably high level of measurement accuracy. As depicted in Fig. 5a, a significant proportion of the samples exhibit low standard deviation values, underscoring the precision of our measurements.
For a more nuanced measurement error analysis, we computed fractional errors by dividing standard deviation values by the average % FA values of a single experiment (Fig. 5b). The obtained data revealed a discernible trend: as the % FA amount in the measured samples decreases, fractional errors exhibit an increasing pattern. Notably, most samples demonstrate fractional values around 0.2, indicating a higher level of accuracy (Fig. 5c). However, a small subset of samples, found in the lower % FA range, exhibit higher fractional error values due to limitations in the detection of small quantities of material by 1H NMR spectroscopy. This observation is attributed to peak merging, coupled with low signal intensities, leading to reduced accuracy in peak integration (see ESI Section S8†). Importantly, these samples constitute a minor proportion of the overall dataset: approximately 5%.
The influence of each batch, and consequently the sample size, on the overall dataset was assessed using Pearson's R correlation measure. The extended Pearson's correlation matrix and parameter pair plot can be found in the ESI, Section S9.† Initially, it was noted that the water parameter exhibited a relatively low correlation coefficient, which was counterintuitive. Further examination of the data revealed the need to account for water originating from aqueous acid solutions, which was not captured by the initial parameter space generation. Corrections were specifically applied to reactions involving 37% HCl and 70% HNO3 acid catalysts to account for the water in these solutions. The post-correction correlation analysis on the overall data demonstrated a significant improvement in the R factor, increasing from 0.16 to 0.37 for all data.
SHAP analysis was conducted to deepen our understanding of the resultant model. This method provides insights into feature contribution towards the machine learning model predictions, offering a nuanced understanding of its behaviour.42 Positive SHAP values indicate a feature's contribution to increasing the prediction output, negative values indicate a contribution to decreasing the prediction output, and values close to zero signify minimal influence on the prediction. SHAP values correlate with real processes by revealing the impact magnitude of each feature on the model's predictions, helping to understand how changes in input variables affect the outcome, which can align with actual mechanisms or behaviours observed in the real-world process.6
Upon analysing the data, it becomes evident that the concentration of acid moles and the pKa value emerge as pivotal factors shaping the predictions (Fig. 6b). In contrast, water volume, temperature, and duration exhibit diminished significance in influencing the outcomes. These observations align with findings from Pearson's correlation analysis (see ESI, Section S9†), which indicated significant correlations between acid moles and pKa values and the percentage of formic acid (% FA). In contrast, water volume, temperature, and time features exhibit a relatively minor influence on the model's predictions. Notably, the removal of these parameters does not alter the accuracy of the model, underscoring their limited significance in the predictive framework.
Furthermore, the calculated SHAP values were analysed for each model input or feature individually (Fig. 6c and d; other plots are given in ESI, Section S10†). As anticipated, all features demonstrate consistency with the correlations observed between the generated formic acid and the various reaction parameters. In addition, the obtained LightGBM regression model was utilised to predict the distribution of formic acid in the parameter space. By evaluating the determined principal parameters, discrete regions can be identified for estimating the yield of % FA (see ESI, Section S10†). Again, the ability to unveil these subtle effects of modifying specific variables on the outcome of the DMF hydrolysis reaction underpin the utility of ChemSPX in efficiently searching parameter space.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5dd00200a |
This journal is © The Royal Society of Chemistry 2025 |