 Open Access Article
 Open Access Article
      
        
          
            Jordi 
            Buils
          
        
       ab, 
      
        
          
            Diego 
            Garay-Ruiz
ab, 
      
        
          
            Diego 
            Garay-Ruiz
          
        
       *a, 
      
        
          
            Mireia 
            Segado-Centellas
*a, 
      
        
          
            Mireia 
            Segado-Centellas
          
        
       ab, 
      
        
          
            Enric 
            Petrus
ab, 
      
        
          
            Enric 
            Petrus
          
        
       ac and 
      
        
          
            Carles 
            Bo
ac and 
      
        
          
            Carles 
            Bo
          
        
       *ab
*ab
      
aInstitute of Chemical Research of Catalonia (ICIQ-CERCA), The Barcelona Institute of Science and Technology, Av. Països Catalans 16, 43007 Tarragona, Spain. E-mail: cbo@iciq.cat; dgaray@iciq.es
      
bDepartament de Química Física i Química Inorgànica, Universitat Rovira i Virgili (URV), Marcel·lí Domingo, 43007 Tarragona, Spain
      
cEawag: Swiss Federal Institute of Aquatic Science and Technology, Überlandstrasse 133, 8600 Dübendorf, Switzerland
    
First published on 27th July 2024
Herein, we present a new computational methodology that unlocks the prediction of the complex multi-species multi-equilibria processes involved in the formation of complex metal-oxo nanoclusters. Relying on our recently introduced method named POMSimulator, we extended its capabilities and challenged its accuracy with the well-known phosphomolybdate [PMo12O40]3− Keggin anion system. We show how the use of statistical techniques enabled the processing of a vast number of speciation models and their associated systems of non-linear equations efficiently and in a scalable manner. Subsequently, this approach is applied to generate statistically averaged speciation diagrams and their associated error bars. Then, we unveil the previously unreported speciation phase diagram under varying [Mo]/[P] ratios vs. pH. Our findings align well with experimental data, indicating the prevalence of the Keggin {PMo12} as the primary species at low pH, but the lacunary {PMo11}and Strandberg {P2Mo5} anions also emerge as major species at other concentration ratios. Finally, from 7 × 104 speciation models we inferred a plausible reaction network across the diverse nuclearities present within the system, which underlines the role of trimers as key intermediate building blocks.
Polyoxometalates form a wide range of well-defined structures of different sizes and shapes. The self-assembly formation processes of these structures depend on different factors such as pH, temperature, pressure, total metal concentration, ionic force, and the presence of reducing agents and counter-ions. Despite the complexity of controlling the synthesis, POMs are finding relevant applications in the fields of catalysis,5–9 electrochemistry,10 medicine,11–14 and information technologies.15–19 Mass spectrometry,20 X-ray diffraction,21,22 and NMR23 are the most important techniques used experimentally to determine POMs structures. On the other hand, quantum mechanics methods and molecular simulations have provided essential insight for understanding POMs chemistry, their electronic structure and reactivity, and their properties in solution.24 Yet, none of these techniques have described in detail the complex multi-species multi-equilibria processes that form polyoxometalates.
We recently presented a new computational method25–28 named POMSimulator that automatically generates and solves the multi-equilibria non-linear system of equations (NLE) for a given set of molecular oxo-clusters. POMSimulator computes the concentrations of all species at equilibrium, so it allows plotting speciation diagrams (conc. vs. pH) and speciation phase diagrams (total metal conc. vs. pH) from first principles calculations. This methodology was successfully employed to describe the speciation of Mo and W,25,26 and of V, Nb, and Ta27 isopolyanions (IPAs). In all these cases, our method resulted in an excellent agreement with experimental data. POMSimulator steadily evolved since its creation aimed at dealing with increasingly complex chemical systems. Several algorithmic improvements in the deduction of the nucleation mechanisms, generation of formation constants, and parallelization of code led to a 20× speed-up in the NLE resolution step. Recently, we have released a public open-source version of the code.29,30
However, the main issue that limits the general application of our method is that, in such kinds of systems, there are indeed many more reactions than chemical compounds, and thus the resulting system of NLE is overdetermined. To tackle this problem, we introduce the concept of speciation model (SM): a unique subset of chemical reactions and a mass balance equation matching the number of compounds to produce a determined NLE system, as schematically shown in Fig. 1. Consequently, a SM describes the composition of the system at equilibrium. To define SMs, we assume that all protonation reactions in the set must be included (due to the importance of the acid–base behaviour of POMs), with only nucleation reactions varying across models. In this way, we can strongly reduce the total number of produced SMs: for the example in Fig. 1, we go from a total of  models, to only 6 (combining the 4 acid–base reactions and the mass balance with one nucleation reaction at a time). Then, all possible SMs were sorted according to their root mean squared error to the experimental data available, and finally, the single best speciation model was selected. This SM was then used to represent the system, employing its regression parameters to scale all computed equilibrium constants.
 models, to only 6 (combining the 4 acid–base reactions and the mass balance with one nucleation reaction at a time). Then, all possible SMs were sorted according to their root mean squared error to the experimental data available, and finally, the single best speciation model was selected. This SM was then used to represent the system, employing its regression parameters to scale all computed equilibrium constants.
Although that procedure worked well for the simplest metal-oxo IPA clusters, the inherent dependence on experimental formation constants supposes an important drawback in the predictive power of the method. In most cases, neither full speciation diagrams nor datasets of formation constants are available. Instead, there is only qualitative data about the species which are formed, or the pH at which these appear.31,32 Moreover, the problem becomes untreatable as complexity increases, since the number of speciation models grows factorially with the number of species and reactions.
Aimed at broadening the applicability of the method and reducing its close dependence on experimental data, we herein present new developments that establish a basis for exploring huge chemical speciation spaces, guided by stochastic sampling and statistical analysis. First, we introduce the concept of SM ensembles with comparable behaviour and how to select representative models. This new approach allows simulating average speciation diagrams and their associated error bars without relying on experimental data. Then, we show how the new methodology can handle the overwhelming complexity arising from the presence of heteroatoms in the POM structure. Pursuing a better understanding of the mechanisms involved in the formation of the Keggin anion, we chose this well-known phosphomolybdate system to challenge our method. For the first time, computed speciation diagrams have enabled the identification of certain species detected in experiments but not yet fully characterized. Considering the presence of species containing both phosphorus and molybdenum, we offer two perspectives of the speciation phase diagram: one emphasizing the phosphorus percentages within the species and the other focusing on the molybdenum percentages. Moreover, the analysis of the reaction networks embedded into tens of thousands of speciation models permitted the identification of the most relevant reaction mechanisms in play.
The notion of SM ensembles has been introduced to consider the variability of the NLE systems that are under treatment: as the nucleation reactions selected among different SMs vary, the speciation predicted by every model can be dramatically distinct. It is noteworthy that small numerical changes in the equilibrium constants can lead to completely different speciation scenarios. Since assessing descriptors to categorize and compare SMs is not trivial, herein we decided on an approach based on identifying the most relevant features of the speciation diagram computed for every SM. The advantage of such an approach is that the speciation can be immediately referenced to experimental information, even when only qualitative data is available.
Because of the high variance in the speciation results, a naïve average of a complete collection of speciation diagrams would be unlikely to reflect well the actual behaviour of the system. In contrast, we pursue to apply a clustering approach over the collection of models to encounter the groups with the most distinct speciation behaviours. Then, we average only inside these groups, unravelling the key typologies that can be extracted from the data. Moreover, we also have access to the standard deviation of each group, which we can use to estimate the uncertainty associated with each of our speciation predictions.
As shown in Fig. 2, the first step of the workflow is the featurization of the speciation diagrams, thus reducing the dimensionality. The collection of all the speciation diagrams is stored as a 3D array of size Nspecies·NpH·Nmodels. In general, the number of pH points (NpH) shall be relatively large, as the resolution of the NLE systems encoded in each model does not behave well for sparse pH grids. The goal of the featurization is to reduce the dimensionality along the pH axis, characterizing the speciation diagrams through a set of parameters related to the shape and location of each peak for each species: maximum height, width, area, and position of the maximum. After testing several feature subsets (see Fig. S8–S11 in the ESI†), we encountered that considering the height, width, and position of the peaks was enough to reproduce the behaviour of the whole diagram across the pipeline, reducing the input to a 2D matrix of shape (3Nspecies·Nmodels).
For the clustering stage, we selected an unsupervised K-means algorithm to group the SMs, applying also Principal Component Analysis (PCA) to visualize the spread of the detected clusters. While alternative approaches could be proposed (e.g. t-SNE, DBSCAN, autoencoders…), K-means was the most size-scalable method, which was essential to tackle large systems having more than 1 × 106 SMs. From these clusters, a chemistry-driven selection is performed, discarding the groups having predictions that are off from experimental results or chemical knowledge. The criterion for discarding a group could be, for example, related to the presence of large molar fractions for species that are not reported, or just by the appearance of peaks away from the expected pH. For instance, if we compare the second group with the experimental reference from Fig. 3, we find that the central peak is overestimated, and the shapes of the extreme peaks are not well-reproduced. Therefore, in this situation, we will select the first group to describe the speciation of the system. Although the number of desired clusters in K-Means must be specified beforehand, the proposed selection scheme which regroups all the clusters that have not been discarded makes the tuning of this parameter less critical.
Further refinement inside a given group of models can be achieved either through applying the clustering protocol a second time, if the number of models it contains is large enough, or else by filtering out its most extreme observations (outlier filtering). To achieve this (bottom part of Fig. 2), we select a given species in the diagram to be adjusted and then build the corresponding box and whisker plots for its molar fraction values at every pH point or a selected subrange of pH points. From there, the points that are further than 1.5 times the interquartile range of the sample (beyond the limits of the whiskers) can be discarded, consequently refining the description of the target species in the speciation. Then, the average speciation diagram is computed for each of these clusters, including error bands from the standard deviation (considering a ± σ interval from the average value).
In the original POMSimulator workflow, we generated speciation diagrams, speciation phase diagrams, and reaction mechanisms for the best SM, selected from the RMSE values obtained through linear regression against experimental data. In contrast, the novel approach uses a selected group of SMs rather than one single best model. In this way, we are now considering the diversity of different groups of SMs and the similarities among the models of a given group or groups.
Another important point to discuss is that the raw DFT computed formation constants are overestimated,25–27 so they need to be scaled, mainly due to limitations associated with the modelling of acid–base reactions and the corresponding solvation effects. As we did previously, and since experimental formation constants for this system were available,31,33,34 they can be employed to scale the overestimated DFT formation constants computed by our methodology. However, herein we rely on a randomly chosen SMs set, so looking for the best SM would not make sense. Instead, we propose to use the average slope and intercept values from the whole SMs approximately 3 × 106, avoiding the choice of a unique representative model. In this way, the treatment becomes more general and applicable to cases where random sampling is applied, leveraging the large set of models that the POMSimulator generates. This resonates well with previous findings hinting at a possible universal scaling for formation constants.28 As detailed in the ESI (Fig. S1),† the employed regression equation for all reported results was log![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) Kexp = 0.28
Kexp = 0.28![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) KDFT − 2.02. This equation showcases the overestimation of the theoretical formation constants, which have to be scaled by a factor of 0.28, and then subtracted 2.02 logarithmic units.
KDFT − 2.02. This equation showcases the overestimation of the theoretical formation constants, which have to be scaled by a factor of 0.28, and then subtracted 2.02 logarithmic units.
To validate this approach, we selected three of the systems that we had already successfully explored: W, Nb, and V IPAs. As shown in Table 1, these selected systems enable us to check the adequacy of our clustering approach for widely different numbers of SMs, to test how well the process scales, and for increasingly complex speciation diagrams. Details on the protocol followed for each of these systems, as well as the comparison between the predicted speciation diagrams and experimental results, are available in the Fig. S2–S7 in the ESI.† Nonetheless, as a general note, all three examples resulted in reasonable agreements between the clustering-based and the best-model-based methods, confirming the consistency of these two modes of operation. As we have proven from the W, Nb, and V systems, using sub-samples is enough to simulate the complex speciation of POMs. At this point, we were confident that the results obtained from the chosen sample (1%) were representative from a statistical point of view.
| Metal | Nspecies | Nreactions | Number of SMs | Number selected SMs | Acid/base behaviour | 
|---|---|---|---|---|---|
| W | 51 | 67 | 50k | 1051 | Acidic | 
| Nb | 39 | 66 | 500k | 6642 | Alkaline | 
| V | 42 | 75 | 1M | 12 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 971 | Amphoteric | 
| PMo | 49 | 109 | 300M | 25 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 761 | Acidic | 
|  | ||
| Fig. 4  (a) Experimental speciation diagram for phosphomolybdate system (adapted from Cadot et al.33); (b) Simulated speciation diagram from 25 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 761 selected SMs. Lines represent the average value of concentration, and the shading represents the deviation from that value; (c) simulated speciation diagram grouped by nuclearities. For a same nuclearity all protonation states were added into a single line. | ||
Our workflow starts indeed by defining a set of building blocks that, in the present case, comprises 49 chemical species ranging from phosphate [PO4]3− and molybdate [MoO4]2− monomers, to dimers, trimers, etc. until the {PMo12} species itself, at several protonation states (see Computational Details). For the Keggin anion, three species [PMo12O40]3-, [HPMo12O40]2− and [H2PMo12O40]− were included while larger species such as the Wells–Dawson {P2Mo18} were excluded to reduce complexity and enable comparison with recent experimental data.33 From there, we solved the NLE systems of the 3 × 106 randomly sampled SMs (1% of 3 × 108) for an array of pH values, and constructed the corresponding speciation diagram of every SM. After applying the statistical treatment described above, we ended up with a total of 2.5 × 104 SMs that were in reasonable agreement with experimental evidences.31–34 Finally, we simulated an average speciation diagram (Fig. 4b), representing the amount of each species at equilibrium expressed as their phosphate fraction, with their corresponding error bars (shaded areas), versus pH.
At first glance, comparing the experimental speciation diagram (Fig. 4a) with the simulated one (Fig. 4b) is not entirely satisfactory, as the protonation state of the various species suggested by the experiments does not always align precisely with the species predicted by the simulation. Note that for some species, the shadow areas in Fig. 4b, indicating the standard deviation of each curve, is very narrow thus the curve predicted is quite accurate. We can present the same results with much better agreement with the experimental results if we group species of the same nuclearity in the speciation diagram, as shown in Fig. 4c. As expected, the assembly of phosphomolybdates happens only at acidic pH: in alkaline conditions (pH > 7), only the acid–base processes of monomers are observed, in our case phosphate in Fig. 4. We predict the same main species as in the experiments: the Keggin anion {PMo12}, non-protonated or mono-protonated, appears as the major species at pH < 2. The lacunary {PMo11} is major species in the experimental diagram in the range 2<pH < 4, and it also appears in Fig. 4c although with lower intensity. Remarkably, the Strandberg anion {P2Mo5} also emerges in the simulated diagram but slightly displaced towards more acidic conditions. Additionally, we also found the two {PMo9} lacunary structures reported by Pettersson and by Cadot, although at lower concentrations. B-{PMo9} is the pink-yellowish peak centred at pH = 4 so it corresponds to [PMo9O34]9− species. Then, B-{PMo9} is [PMo9O31]3−, which appears at very low pH and as a minor species.
In general, we can see how as pH increases there is an important degree of disassembly, going from the most metal-rich species {PMo12} to smaller clusters. This evidence is explained because the self-assembly of phosphomolybdates is promoted by the presence of protons. We were also able to identify {PMo5}, the precursor of {P2Mo5}, which was previously unreported. The robustness of this approach is indeed confirmed by the presence of the key {P2Mo5} cluster, which was absent when applying the previous methodology that only considered the lowest-RMSE best model in the dataset (Fig. S11 in the ESI†). Even though the maximum concentrations do not fully match the experimental values, we predict the same dominant species in the same relative positions. This mismatch is a consequence of considering an average value across multiple speciation models, as explained above. Therefore, if we consider the standard deviation for each species, we obtain an error band that is closer to the expected values. Moreover, the Keggin anion is in any case predominant in the 0 < pH < 2 range, as expected. It is worth noting that all the speciation diagrams presented in Fig. 4 correspond to experimental conditions ([Mo]/[P] = 12) that favour the formation of the Keggin anion.
Given the success of the simulation of the PMo speciation diagram, the next step was the computation of the corresponding speciation phase diagram (see Fig. 5). Note that for generating speciation phase diagrams for IPAs, it was necessary to calculate the speciation for the single best SM for all pH points at different values of total metal concentration. When applying the same methodology to HPAs, however, there is a fundamental change. As HPAs contain both metal and heteroatom, the speciation is strongly dependent on the ratio between them. For this reason, the key parameter for the Y-axis of the speciation phase diagram was the [Mo]/[P] ratio, instead of the total metal concentration in the system. Also, under the current paradigm, the treatment of speciation ensembles required us to consider more than one SM to build the phase diagram.
We selected from the statistical treatment the SMs which were known to provide a good prediction of the speciation at a metal/heteroatom ratio of 12 (Fig. 4). From there, we took the same group of SMs to compute the average speciation phase diagram through a grid of 20 points for the [Mo]/[P] ratio (ranging from 2 to 12), and 280 points for the pH. As both metal and heteroatom percentages can be considered, we plotted two complementary speciation phase diagrams, employing the phosphate fraction (Fig. 5 above) and the molybdate fraction (Fig. 5 below). In this way, it is possible to analyse the preferred phases for phosphorus and molybdenum at any pH-ratio point and separately.
Up to pH = 2 and at a large [Mo]/[P] ratio (>7), the Keggin anion is the major phase for both P and Mo-containing species. However, if the relative amount of Mo decreases, free phosphoric acid becomes the dominant form of phosphorus. As pH increases, the Keggin-lacunary anion {PMo11} becomes dominant, appearing at large ratios for Mo and P. However, {PMo11} is competing with the Strandberg anion {P2Mo5}, which becomes the main phase, both at lower [Mo]/[P] ratios and at 2 < pH < 3. As aforementioned, this species was not predicted by the single best SM with the lowest RMSE. Therefore, its central role in the predicted phase diagrams is another confirmation of the adequacy of our approach. Above pH > 3.5, all dominant phosphorus species are phosphates in decreasing states of protonation. Molybdenum, in contrast, is present as the Mo2O7 dimer as another main species until pH = 5, where the molybdate [MoO4]2− becomes the only major species.
Originally, the reaction mechanism for the self-assembly of IPAs was acquired from the selection of a single best SM, which consists of a single set of reactions. Following the previous discussion, we took the same selected SMs group as in the speciation and phase diagrams to analyse the chemical reactions. For each nuclearity in the molecular set, we looked for all the reactions that formed it and calculated the frequency of each reaction appearing in the selected SMs group. From there we chose the reactions with the highest frequency to represent the formation of each nuclearity, as depicted in Fig. 6. Interestingly, the reactions for {Mo2} 1, {PMo9} 11, and {P2Mo5} 15 presented particularly high frequencies, hinting at their importance in the mechanism. In general terms, the preferred reaction was not the thermodynamically most favourable. Condensation reactions that allow growing larger clusters are driven forward by the complexity of the reaction network and the coupling of nucleation and acid–base equilibria, as already discussed in a previous work.27 This highlights the need for our approach, which includes acid–base equilibrium and condensation and addition reactions, to describe the reaction mechanism for the formation of such nanoclusters. A standard linear description of potential energy curves with only the most stable species would not adequately depict the pathway of cluster formation.
Regarding the formation of the Keggin anion 14, common chemical intuition assumed that trimers should play a key role.35,36 Simple visual analysis of Fig. 6 reveals that the trimeric species 3, 4 and 5 are highly interconnected. Particularly, the {Mo3} 3 that governs most of the cascade of reactions leading to Keggin through three consecutive additions: with phosphate 7 to lead {PMo3} 8, then with 8 towards {PMo6} 10, and with 10 to give {PMo9} 11. Finally, the resulting {PMo9} 11 reacts with the linear trimeric species {Mo3O10} 4 and forms the Keggin anion. Alternatively, the formation of the Keggin-lacunary 13 is achieved by the reaction of the {PMo9} 11 with a dimeric species 2. Finally, the other main compound {P2Mo5} 15 is formed after one {PMo3} 8 reacts with a dimeric species 1 to give the {PMo5} 9, which can react with another phosphate 7 to reach the Strandberg anion.
Noteworthily, the reaction network also shows the distinct nature and role of the two {PMo9} species. A-{PMo9} 11 is a highly connected node, with degree equal to 4, and is the intermediary step in the pathway towards lacunary 13 and Keggin 14. However, the other B-{PMo9} species, designated as 12 in the reaction network, exhibits degree two akin to species 13, 14, and 15. This indicates that all these species serve as endpoints in the network, representing potential reaction products achievable under favourable conditions.
The topological properties of the reaction network arising from our massive numerical analysis places a reactant at the same level as products. Actually, phosphate anion 7 is the only other node with degree 2 in the network because it relates to very few species, contrarily to the other reactants molybdate 0 or other Mo species having a larger degree. The network shows that the heteroatom P is indeed incorporated into the forming POM skeleton through an addition reaction with trimer 3, this step being computed as the most exergonic reaction in the network. Some other steps also show large negative ΔG values, particularly those reactions involving positive and negatively charged ions, as expected. However, other addition reactions between two negatively charged ions showed to be largely exergonic, which shows the importance of accounting for entropy effects. Fig. 6 depiction of the reaction network thus summarizes the most occurring connections between the different species independently of its protonation state, which depends on pH.
By sampling randomly a vast number of speciation models, our workflow could process the data of more than 3 × 106 speciation diagrams: featurization and dimension reduction, K-means clustering, chemical-basis selection, and outliers filtering. This resulted in 2 × 104 finally selected speciation models, which were used to generate what we called a statistically average speciation diagram, accompanied by error bars. The final average diagram is in full agreement with experiment and allowed identifying two {Mo9} species previously reported in literature.
Furthermore, our work has uncovered the speciation phase diagram, i.e., total concentration vs. pH for an heteropolyoxometalate system. Indeed, for the first time, we introduced a speciation phase diagram in the form of [Metal]/[Heteroatom] ratio vs. pH. Moreover, we provide two views of the diagram, either focussed on the P-based or on the Mo-based species.
Moreover, our findings corroborate well with experimental data, revealing the dominance of the Keggin {PMo12} species at low pH levels. Interestingly, our analysis also highlights the significant presence of lacunary {PMo11} and Strandberg {P2Mo5} anions under varying concentration ratios, shedding light on the diverse molecular compositions that arise within the system. Additionally, our comprehensive examination of many speciation models has led to the inference of a plausible reaction mechanism, emphasizing the pivotal role of trimers as key intermediate building blocks in the speciation process.
The current understanding of the self-assembly processes that lead to the formation of large molecular metal-oxo clusters is based on an empirical trial-and-error basis. The novel understanding that this new simulation methodology can provide is of particular importance for improving the rational synthesis of polyoxometalates. To conclude, this work demonstrates a way to deal with the intricate reaction network governing the reactivity of phosphomolybdates and derives new and valuable insights into the distribution of species under different chemical conditions, thereby enriching our knowledge of complex systems speciation.
| Footnote | 
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4sc03282a | 
| This journal is © The Royal Society of Chemistry 2024 |