 Open Access Article
 Open Access Article
      
        
          
            Nicolas 
            Tielker
          
        
       a, 
      
        
          
            Michel 
            Lim
a, 
      
        
          
            Michel 
            Lim
          
        
       b, 
      
        
          
            Patrick 
            Kibies
b, 
      
        
          
            Patrick 
            Kibies
          
        
       a, 
      
        
          
            Juliana 
            Gretz
a, 
      
        
          
            Juliana 
            Gretz
          
        
       c, 
      
        
          
            Björn 
            Hein-Janke
c, 
      
        
          
            Björn 
            Hein-Janke
          
        
       d, 
      
        
          
            Christian 
            Chodun
d, 
      
        
          
            Christian 
            Chodun
          
        
       a, 
      
        
          
            Ricardo A. 
            Mata
a, 
      
        
          
            Ricardo A. 
            Mata
          
        
       *d, 
      
        
          
            Paul 
            Czodrowski
*d, 
      
        
          
            Paul 
            Czodrowski
          
        
       *b and 
      
        
          
            Stefan M. 
            Kast
*b and 
      
        
          
            Stefan M. 
            Kast
          
        
       *a
*a
      
aDepartment of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Straße 4a, 44227 Dortmund, Germany. E-mail: stefan.kast@tu-dortmund.de
      
bDepartment of Chemistry, Johannes Gutenberg University Mainz, Duesbergweg 10-14, 55128 Mainz, Germany. E-mail: czodpaul@uni-mainz.de
      
cFaculty of Chemistry and Biochemistry, Ruhr University Bochum, Universitätsstraße 150, 44801 Bochum, Germany
      
dInstitut für Physikalische Chemie, Georg-August University of Göttingen, Tammannstraße 6, 37077 Göttingen, Germany. E-mail: ricardo.mata@chemie.uni-goettingen.de
    
First published on 2nd September 2025
The development and testing of methods in computational chemistry for the prediction of physicochemical properties is by now a mature form of scientific research, with a number of different methods ranging from molecular mechanics simulations, over quantum calculations, to empirical and machine learning models. Blind prediction challenges for these properties are regularly organized to allow researchers from academia and industry to test their methods in a fair and unbiased manner. At the same time, research data management (RDM) is still not utilized as extensively as it could be in the development and application of such models, especially in academia. In particular, the FAIR standards (Findable, Accessible, Interoperable, Reusable) can serve as guidelines for good RDM, but many models, the data used to train them, and the data they generate fall short of one, or multiple, of these standards. The goal of the first euroSAMPL pKa blind prediction challenge was to promote and help develop good RDM standards for computational chemistry. To achieve this, the challenge was designed to rank not just the predictive performance of the models but also evaluate the adherence to the FAIR principles by cross-evaluation of the participants themselves. We here present the analysis of the blind prediction quality by their statistical metrics as well as of the cross-evaluation by a newly defined “FAIRscore”. The results suggest that multiple methods can predict the pKa to within chemical accuracy, but also that “consensus” predictions constructed from multiple, independent methods may outperform each individual prediction. Furthermore, the state of research data management in the field of computational chemistry is discussed, and suggestions for future improvements developed.
In this context community-organized blind-challenges play a very special role. Not only do they provide the very much needed curated pool of data, they create the conditions for an unbiased test of computational protocols. Every method has its unique switches and knobs. In the case of electronic structure theory there is the choice of chemical model (theory level for structure optimizations, solvent model, etc.), for machine learning approaches the training data, model layout, and hyperparameters. With the a priori knowledge of the target quantity any method can be adjusted, giving an unfair advantage to protocols with the most flexible parametrizations. Only by depriving the predictors of their targets can one truly assess their predictive power.
To achieve such an unbiased assessment of computational model performance various blind prediction challenges have emerged over the past decades. They share the common characteristic that the modeling community is asked to solve a certain task, the prediction of some experimentally measurable quantity, given only some molecular or system description and details about the experimental setup. Only after the previously announced challenge run time has ended experimental data are revealed to allow for statistical evaluation of model performance and ranking of different methodologies. Repeating such challenges on a single type of quantity then facilitates a historical perspective on the further development of computational models which is per se interesting. For instance, when the historical trend of a typical statistical metric such as the root mean squared error (RMSE) between experimental data and theoretical predictions plateaus and converges at some finite non-zero value, one can ask whether this is an indication of the limiting experimental uncertainty or of technically insurmountable difficulties to improve models further within limited resources.
The history of blind challenges for the simulation of molecular systems is fairly clear-cut. Periodic challenges include the Cambridge Structure Prediction (CSP) blind test, which goes back to 1999.4 In this latter challenge computational predictions are assessed on a set of unpublished molecular crystals. The Critical Assessment of methods of protein Structure Prediction (CASP) provides an analogue for proteins5 with the fourteenth edition held in 20206 playing a pivotal role for the recognition of deep learning methodologies (namely the Alphafold2 model7). Individual experimental blind challenges have also been recently pushed forward by smaller groupings,8–12 but in such cases it is hard to keep the same continuity and level of visibility as CSP or CASP. The field of drug discovery has benefitted from – now discontinued – grand challenges (GC) organized by the D3R (drug design data resource)13 and its predecessor, the community structure-affinity resource (CSAR).14 A common trait across these initiatives, small or large, is the availability of experimentalists to not only conduct the necessary experiments, but also to patiently wait for the challenge to be concluded before publishing their results. This can take between months and years, depending on how fast the data analysis can be carried out and/or the participants are able to provide all the needed material for publication. Such a high effort can, however, turn out to be highly valuable, as has been demonstrated by the successful completion of the CACHE (critical assessment of computational hit-finding experiments) challenge #2.15,16
Another long-running challenge series is the statistical assessment of the modeling of proteins and ligands (SAMPL).17–19 These are aimed at a critical assessment of the predictive power of computational protocols in different facets of rational drug discovery. This includes experimentally determined quantities such as binding affinities, hydration free energies, partition and distribution coefficients, with the targeted observables depending on the specific edition. In an effort to keep the community action alive and growing, new promoters for the challenge came together and created an extension, this time with the experiments being carried out in the European region, therefore coined as euroSAMPL.
The first euroSAMPL blind prediction challenge (“euroSAMPL1” in what follows) picked up an earlier target also addressed during SAMPL6–8, namely an investigation of the ability of computational methods to predict acidity constants (pKa) for drug-like small molecules. Our primary goal was to define a set of chemically diverse, yet well-characterized and controlled compounds in the sense that only a single macroscopic transition, i.e. change of charge, was experimentally observed in the pH range 2–12, and for which we expected dominance of only a single tautomer in each charge state according to our own calculations. This way, we hoped to attract participants from very diverse modeling communities, ranging from atomistic, quantum-mechanical (QM) methods up to empirical rule-based and machine learning approaches, as only the macroscopic pKa values had to be predicted without explicit reference to ensembles of coupled charge and tautomer (so-called microstate) transitions, as was required starting with the SAMPL7 challenge.20
This challenge design allowed for very diverse methods and, as a consequence, very different formats of primary raw data from which a single macroscopic quantity is derived. Therefore, blind prediction challenges also represent an ideal environment to test and foster adherence to modern standards of research data management (RDM). Making research data FAIR21 (Findable, Accessible, Interoperable, Reusable) is an increasingly important requirement for research groups, scientific journals, and funding organizations, and significant progress has been made by taking advantage of the increasing digitalization of research data. Furthermore, good scientific practice demands that research data is published in a way that makes it reproducible. The reproducibility of computational chemistry data using only the information in a given journal article and its supporting information is vital for other researchers to easily verify and use newly developed methods. For the combination of FAIR data with data reproducibility standards to make RDM even “fairer”, we choose the acronym FAIR+R. The relevance of adding “reproducibility” to the FAIR principles has also independently been recognized by others.22 This includes methods such as the automated or manual annotation of generated research data with relevant author- and domain-specific metadata, persistent storage in suitable repositories accessible to other researchers, and the transparent and – as the ultimate goal – fully automated23 analysis of raw data to generate the chemically relevant information.
In Germany, the NFDI4Chem24 is a consortium of the “Nationale Forschungsdateninfrastruktur” (NFDI, National Research Data Infrastructure) responsible for developing sustainable RDM standards and infrastructure for both experimental and theoretical fields in chemistry. One of the specific goals of NFDI4Chem is the design of use cases that allow for testing the usage of RDM tools, and acceptance and adherence to RDM standards in the community.25 The euroSAMPL challenge was designed as such a use case by requiring participants to not only submit the target predictions but to also describe the methodology, including provision of underlying raw data, in a maximally transparent and reproducible format. In order to evaluate correspondence to FAIR+R principles participants were asked after the challenge had finished to anonymously peer-evaluate the submissions of all other participants using a standardized questionnaire. The availability of prediction metrics comparing theory and experiment as well as the resulting “FAIRscores” allowed for ranking and discussing submissions according to model and RDM quality, and their combination. This way, and with an outlook to future challenges, we intend to continually raise the bar simultaneously for both, model development and RDM standards in the computational chemistry field, expecting this challenge design to also stimulate progress toward generally accepted community-specific metadata formats.
The present paper describes the setup and timeline of the challenge including the rationale behind the choice of the systems, covering experimental details and results of our own preliminary calculations. Submissions and their evaluations are discussed, followed by analysis and interpretation of resulting model prediction metrics and FAIRscores. Insights and perspectives for future challenges conclude our report.
The measurements were conducted on a Sirius T3 instrument from Pion.26 The pKa determination was carried out in UV-metric mode, utilizing a 10 mM DMSO solution, with 5 μL of the sample prepared in 25 μL of phosphate buffer, to maintain accurate pH control throughout the titration. The analyses were conducted under argon flow at a temperature of 25.0 °C, with an ionic strength adjusted to 0.15 M using KCl in water or water/cosolvent mixtures, respectively. Each measurement was performed in triplicate within the same vial. Multiple sets of triplicate measurements for 2–4 times for the same compound were arithmetically averaged to determine the final pKa values.
For poorly soluble compounds (i.e. all except euroSAMPL-2, -5, -11, -13, -14, -15, -17, -19, -20, -22, and -27), methanol–water mixtures adjusted to 30, 40, and 50 vv-% were used to increase solubility for the measurements over the entire experimental pH-range. From Yasuda–Shedlovsky extrapolation performed by the analytics software the pKa values obtained in these methanol–water mixtures were then extrapolated to determine the aqueous pKa. Based on previous measurements in various buffer/cosolvent mixtures we expect that the residual DMSO amount does not affect the aqueous equilibrium.27 The experimental pKa values ranged from 2.9 to 9.5.
The experimental data, presented as macroscopic pKa values, did neither reveal which group was predominantly titrated, nor the identities of the associated macrostates (total charge), nor contributing microstates (tautomers). Additionally, the data provided no information about the charge states of the protonated and deprotonated species corresponding to each macroscopic pKa.
The initial set of molecules were screened based on the quality of the pKa measurements conducted on the Sirius T3. Measurements with missing datapoints, solvation issues or poor fits were excluded from the dataset. The remaining molecule representations were standardized using RDKit (version 2021_09_2) by removing salts, neutralizing charges and generating canonical SMILES representation.28 Possible tautomers were enumerated using OpenEye's QUACPAC software (version 2.1.3).29 Subsequently, these tautomers were cross-referenced with a comprehensive literature dataset compilation to ensure that these specific molecules had not been previously measured. The literature database comprised of different datasets, integrating data from multiple sources, including public databases such as ChEMBL,30 DataWarrior,31 datasets from the Statistical Assessment of the Modeling of Proteins and Ligands (SAMPL) challenges,20,32 experimental measurements from our laboratory, and data extracted from various publications.33–40 The filtered data set was further processed by removing molecules with more than one stereocenter and those exhibiting more than one measured pKa value. Stereocenters were identified using RDKit. After all filtering steps the final set of 35 euroSAMPL1 challenge molecules is depicted in Fig. 1 along with their experimental pKa value; Fig. 2 shows molecular specification distributions.
|  | ||
| Fig. 1 euroSAMPL1 compounds and their experimentally measured pKa values. The group at which (de-)protonation predominantly occurs, as predicted by ChemAxon Marvin, is marked with a green circle. | ||
The dominant tautomers and protonation states for our own reference calculations were generated based on pKa values and underlying microstates predicted by ChemAxon Marvin (version 21.20).41
3D geometries for the reference calculations were generated as follows: the SMILES string for each microstate of each compound was used to generate initial structures with RDKit's EmbedMultipleConfs module.44 In accordance with our usual workflow, 50 conformations were generated for all microstates, as the number of rotatable bonds was smaller than 7 for the entire set of compounds.43 These initial structures were then preoptimized with the sander utility of AMBER20, using an ALPB solvent representation with a fixed dielectric constant of 78.5.45,46 Following this preoptimization the structures were pruned by removing all structures at least 5 kcal mol−1 higher in force field energy than the minimum for that microstate, and then clustered with a distance criterion of 0.5 Å, starting with the lowest energy conformation as the first cluster representative.
The remaining cluster representatives were further optimized at the B3LYP/6-311+G(d,p) level of theory47,48 with the IEFPCM solvation model for water with default settings, using Gaussian 16 Rev. C.01 with tight convergence criteria, the default pruned “ultrafine” grid, and explicitly computing the force constants after converging the first SCF iteration.49 The optimized structures were again clustered without energy cutoff and the same distance criterion, and up to five of the remaining lowest energy cluster representatives were used in subsequent calculations with the “embedded cluster reference interaction site model” (EC-RISM).50 EC-RISM is based on a combination of the three-dimensional “reference interaction site model” (3D RISM) integral equation theory with quantum-mechanical (QM) calculations of the electronic structure.
EC-RISM calculations were conducted at the MP2/6-311+G(d,p) level of theory, using Gaussian 09 Rev. E.01 for the QM part of the calculations, as consideration of electron correlation has turned out to be essential for predicting accurate pKa values by EC-RISM.42,43 The radius of the electrostatic potential for bromine atoms was set to 1.3 Å. During the 3D RISM calculations, a cubic grid of 1283 points with a spacing of 0.3 Å was employed and the solvent represented by our modified SPC/E model51,52 with the PSE2 closure,53 while the solute was represented by the GAFF 1.7 force field's Lennard-Jones parameters for the non-electrostatic solute–solvent interactions.54 Electrostatic contributions were computed directly from the wave function, as outlined in ref. 43.
The molecular and tautomer energies were calculated by Boltzmann-weighting all Gibbs energies of the same ionization state, or all Gibbs energies of the same tautomer, respectively, from which the macroscopic pKa values were derived.43
At the same time a questionnaire about the metadata and raw data was sent to the participants, to allow them to evaluate every submission except for their own. Most participants used this opportunity, leading to each submission being evaluated by 6 or 7 peers. The results of this metadata questionnaire were combined with the prediction results and published to GitLab for the three best-performing submissions on 2024-06-11. The cross-evaluation of the metadata fields and the submitted raw data were part of the FAIR+R strategy underlying the euroSAMPL1 blind prediction challenge. Some metadata had already been collected in earlier SAMPL challenges, but here the goal was to formalize the process of collecting author-specific metadata and extend it to gathering community suggestions for domain-specific metadata that are necessary to describe their calculations and make them reproducible for other researchers. For this reason the author-specific metadata were mandatory and identical for all participants, and were selected in analogy to commonly used metadata standards, such as the Dublin Core Metadata Set and the DataCite Metadata Schema.57,58 The domain-specific metadata on the other hand depend strongly on the specific method used, and often even differ depending on the software used to implement it. In the absence of a ready-made solution, the participants were tasked with describing, e.g., the software packages and settings used for their calculations in as deep detail as necessary for other researchers to reproduce the calculations.
There were no other constraints on the type or size of the computational raw data submitted by the participants, though in practice very large submissions would have required special considerations due to limitations of the transfer protocol. The raw data could span from unstructured collections of input and log files to structured and annotated tables of energies used for the pKa calculations, including the scripts used for this. And while none of the participants chose to do so, it would have been possible to submit one “ranked” submission (meaning the only submission entering the final score) together with additional “unranked” submissions (for which no FAIRscore would have been awarded, yet the submission's metrics would have been provided) utilizing different methods, or variations of the same method, for the pKa prediction. A detailed graphic depicting the challenge infrastructure design is shown in Fig. 3.
The optional metadata fields and raw data formed the basis for determining the submissions’ FAIRscores. It was decided early on to let the challenge participants evaluate each other's submissions with a questionnaire using Google Forms. In short, the participants were given four statements on the “findability”, “interoperability”, “reusability”, and “reproducibility” of the meta- and research data. The research data's and metadata's relative “accessibility” was not explicitly evaluated, as the access criteria were identical for all submissions stored in the GitLab euroSAMPL challenge repository.55 These questions, which had to be evaluated on a scale from 1 (fully agree) to 6 (fully disagree), were:
• The metadata field names are understandable and informative for the general audience (interoperability).
• The supplied metadata is sufficient to set up comparable calculations as were used to generate the predictions (reusability, reproducibility).
• You would use the metadata field names of the submission to search for the data in repositories that offer free-text search (findability).
• The submitted raw data and documentation are sufficient to comprehend and to enable reproduction of the predictions (reusability, reproducibility).
While these four questions alone certainly do not cover the full breadth of the FAIR principles, they were intended to let the participants evaluate the submissions’ FAIRness from the point of view of experts in theoretical chemistry, not in research data management. This would allow them to take a broader perspective when considering the questions, without requiring in-depth knowledge of the FAIR criteria and their specific definitions. Because all participants had to evaluate all other participants, the average, relative evaluation of the submissions was assumed to be consistent. The FAIRscore itself was then defined as a normalized value between 0, corresponding to an average evaluation of 1.00 (fully agree), and 1, corresponding to an average evaluation of 6.00 (fully disagree).
To facilitate a combined ranking of FAIRness and prediction quality, the prediction RMSEs were also normalized to a value between 0 and 1, but due to the theoretically unbounded nature of the RMSE these values were instead mapped to the lowest RMSE of all ranked submissions, defined as an RMSEscore of 0, and the highest RMSE, defined as an RMSEscore of 1. The average of the two individual scores was then used for the final, combined ranking metric.
As the goal of the euroSAMPL challenge was to compare the performance of different pKa prediction methods without the additional complications arising from having to consider multiple microstates in the same protonation state, the detection of additional tautomers would have led to the exclusion of such compounds. The calculated populations and relative free energies for the originally generated microstates are shown in Table 1.
| Compound | Microstate | Charge | Population | ΔGtaut | Compound | Microstate | Charge | Population | ΔGtaut | 
|---|---|---|---|---|---|---|---|---|---|
| a States that were not included because the resulting macroscopic EC-RISM-predicted pKa values were outside of the experimental range. | |||||||||
| euroSAMPL-1 | T0 | 0 | 100.00 | 0.00 | euroSAMPL-19 | T0 | 0 | 100.00 | 0.00 | 
| T1 | −1 | 100.00 | 0.00 | T1 | 1 | 100.00 | 0.00 | ||
| euroSAMPL-2 | T0 | 0 | 100.00 | 0.00 | euroSAMPL-20 | T0 | 0 | 100.00 | 0.00 | 
| T1 | −1 | 100.00 | 0.00 | T1 | −1 | 100.00 | 0.00 | ||
| euroSAMPL-3 | T0 | 0 | 100.00 | 0.00 | euroSAMPL-21 | T0 | 0 | 100.00 | 0.00 | 
| T1 | 1 | 100.00 | 0.00 | T1 | −1 | 100.00 | 0.00 | ||
| euroSAMPL-4 | T0 | 0 | 100.00 | 0.00 | euroSAMPL-22 | T0 | 0 | 100.00 | 0.00 | 
| T1 | 1 | 100.00 | 0.00 | T1 | −1 | 100.00 | 0.00 | ||
| euroSAMPL-5 | T0 | 0 | 100.00 | 0.00 | euroSAMPL-23 | T0 | 0 | 100.00 | 0.00 | 
| T1 | 1 | 100.00 | 0.00 | T1 | −1 | 100.00 | 0.00 | ||
| euroSAMPL-6 | T0 | 0 | 100.00 | 0.00 | T2a | 1 | 100.00 | 0.00 | |
| T1 | −1 | 100.00 | 0.00 | euroSAMPL-24 | T0 | 0 | 100.00 | 0.00 | |
| euroSAMPL-7 | T0 | 0 | 100.00 | 0.00 | T1 | 1 | 100.00 | 0.00 | |
| T1 | 1 | 100.00 | 0.00 | euroSAMPL-25 | T0 | 0 | 100.00 | 0.00 | |
| euroSAMPL-8 | T0 | 0 | 100.00 | 0.00 | T1 | 1 | 100.00 | 0.00 | |
| T1 | −1 | 100.00 | 0.00 | euroSAMPL-26 | T0 | 0 | 0.18 | 3.74 | |
| euroSAMPL-9 | T0 | 0 | 100.00 | 0.00 | T1 | 0 | 99.82 | 0.00 | |
| T1 | −1 | 100.00 | 0.00 | T3 | 0 | 0.00 | 7.64 | ||
| euroSAMPL-10 | T0 | 0 | 100.00 | 0.00 | T5a | −1 | 100.00 | 0.00 | |
| T1 | −1 | 100.00 | 0.00 | T2 | 1 | 100.00 | 0.00 | ||
| euroSAMPL-11 | T0 | 0 | 100.00 | 0.00 | T6 | 1 | 0.00 | 37.17 | |
| T1 | −1 | 100.00 | 0.00 | euroSAMPL-27 | T0 | 0 | 100.00 | 0.00 | |
| euroSAMPL-12 | T0 | 0 | 100.00 | 0.00 | T1 | 1 | 100.00 | 0.00 | |
| T1 | −1 | 100.00 | 0.00 | euroSAMPL-28 | T0 | 0 | 100.00 | 0.00 | |
| euroSAMPL-13 | T0 | 0 | 100.00 | 0.00 | T1 | 1 | 99.66 | 0.00 | |
| T1 | −1 | 100.00 | 0.00 | T2 | 1 | 0.34 | 3.36 | ||
| euroSAMPL-14 | T0 | 0 | 91.49 | 0.00 | T3a | 2 | 100.00 | 0.00 | |
| T1 | 0 | 8.30 | 1.42 | euroSAMPL-29 | T0 | 0 | 100.00 | 0.00 | |
| T3 | 0 | 0.21 | 3.60 | T1 | −1 | 100.00 | 0.00 | ||
| T2(T5,T6) | −1 | 100.00 | 0.00 | euroSAMPL-30 | T0 | 0 | 100.00 | 0.00 | |
| euroSAMPL-15 | T0 | 0 | 99.98 | 0.00 | T1 | −1 | 100.00 | 0.00 | |
| T2 | 0 | 0.02 | 5.09 | euroSAMPL-31 | T0 | 0 | 100.00 | 0.00 | |
| T3 | 0 | 0.00 | 17.68 | T1 | −1 | 100.00 | 0.00 | ||
| T6 | 0 | 0.00 | 15.51 | euroSAMPL-32 | T0 | 0 | 100.00 | 0.00 | |
| T1 | −1 | 100.00 | 0.00 | T1 | 1 | 100.00 | 0.00 | ||
| T4 | −1 | 0.00 | 14.61 | euroSAMPL-33 | T0 | 0 | 100.00 | 0.00 | |
| T5 | −1 | 0.00 | 15.64 | T1 | 1 | 100.00 | 0.00 | ||
| euroSAMPL-16 | T0 | 0 | 100.00 | 0.00 | euroSAMPL-34 | T0 | 0 | 100.00 | 0.00 | 
| T1 | −1 | 100.00 | 0.00 | T1 | 1 | 100.00 | 0.00 | ||
| euroSAMPL-17 | T0 | 0 | 100.00 | 0.00 | euroSAMPL-35 | T0 | 0 | 100.00 | 0.00 | 
| T1 | −1 | 100.00 | 0.00 | T1 | 1 | 100.00 | 0.00 | ||
| euroSAMPL-18 | T0 | 0 | 99.99 | 0.00 | |||||
| T1 | 0 | 0.00 | 5.93 | ||||||
| T2 | 0 | 0.00 | 6.72 | ||||||
| T4 | −1 | 100.00 | 0.00 | ||||||
Some generated microstates of the challenge compounds systematically interconverted during the QM optimization, i.e., every conformation of that microstate was optimized into a different microstate by transferring a proton. Because these tautomerizations during QM optimization always convert higher energy microstates into lower energy microstates, this systematic behavior implies that the initial microstate is not significantly populated in solution and will have no effect on the pKa prediction.
For one of the compounds, euroSAMPL-14, EC-RISM suggested a microstate only 1.42 kcal mol−1 less favorable than the main, neutral microstate T0. However, because deeper investigation using the ChemAxon Chemicalize application did not confirm any presence of this additional neutral microstate, and ignoring the microstate would only shift the calculated pKa by approximately 0.04 for EC-RISM, we decided to retain the compound as part of the dataset. For the remaining microstates, the energies calculated with EC-RISM yielded populations of less than 0.5%, with an energy difference to the next stable tautomer of at least 3 kcal mol−1 so that the energetic contribution to an acidity constant calculated with or without it would be completely negligible, given the experimental and methodical uncertainties known from previous pKa prediction challenges.20,32
| ID | Submission | Method name | Method class | 
|---|---|---|---|
| (1) | 0x4cb7101f | SP1 | ML | 
| (2) | 0x4a6c0760 | r2SCAN-3c/DRACO+ML | QM + ML | 
| (3) | 0xc7960c21 | CBio3Lab_pKa | ML | 
| (4) | 0x4b7b06e5 | BIOVIA COSMO-RS | QM | 
| (5) | 0x216604d8 | QupKake | QM + ML | 
| (6) | 0x421c06f1 | H2O_DFT | QM | 
| (7) | 0x4cb00786 | RIJCOSX-B3LYP-D3BJ(SMD)/cc-pV(T+d)Z | QM | 
| (8) | 0x3f2606c6 | IEFPCM_MST | QM | 
| (9) | 0x541007e2 | uESE | QM | 
| (10) | reference_EC-RISM | precalc | QM | 
| (11) | 0xb8320bc2_seven | seven | Null | 
Automated analysis of the results (see Table 3) already showed that despite the experimental measurements revealing only a single protonation state transition for each compound, multiple participants had found and submitted additional pKa values within or near the experimental range. This was an intended feature of the challenge design to avoid situations in which participants would have been forced to choose between two different protonation state transitions, e.g. from charge −1 to 0 and charge 0 to 1. For these submissions this occasionally led to a difference between our analysis schemes, either using only the “first” submitted pKa value for each compound, or using the “best”, i.e. the one closest to the experimental value, to generate the statistics. For the euroSAMPL1 challenge the instructions had specified that the “first” submitted value should be the one that the participants considered most likely to be the one measured experimentally.
| Compound | exp. | Marvin | (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | (10) | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| euroSAMPL-1 | 3.61 | 3.16 | 3.50 | 3.15 | 4.17 | 3.66 | 3.58 | 4.31 | 4.63 | 3.29 | 2.37 | 3.17 | 
| euroSAMPL-2 | 2.91 | 3.90 | 3.49 | 3.63 | 3.73 | 2.74 | 4.23 | 2.06 | 3.04 | 2.01 | 2.98 | 3.66 | 
| euroSAMPL-3 | 5.02 | 4.54 | 5.13 | 4.84 | 4.72 | 6.57 | 4.63 | 4.48 | 8.18 | 0.98 | 5.14 | 4.69 | 
| euroSAMPL-4 | 6.10 | 5.69 | 6.54 | 6.23 | 5.75 | 7.17 | 5.83 | 6.92 | 5.45 | 1.64 | 7.61 | 4.69 | 
| euroSAMPL-5 | 8.99 | 9.17 | 8.97 | 8.56 | 7.81 | 9.62 | 8.40 | 7.41 | 11.29 | 11.16 | −5.58|8.85 | 8.21 | 
| euroSAMPL-6 | 4.24 | 11.64|4.12 | 3.91 | 4.40 | 3.98 | 4.61 | 3.98 | 5.82 | 4.15 | 4.24 | 3.58 | 3.19 | 
| euroSAMPL-7 | 4.58 | 4.08 | 4.68 | 2.37 | 5.52 | 6.19 | 11.75|3.98 | 7.12 | 1.75 | −0.27 | 1.22 | 3.95 | 
| euroSAMPL-8 | 8.50 | 8.47 | 8.91 | 7.41 | 8.30 | 8.13 | 8.15 | 7.35 | 11.26 | 11.02 | 11.85 | 9.89 | 
| euroSAMPL-9 | 4.40 | 4.71 | 4.41 | 5.36 | 4.24 | 4.94 | 4.56 | 6.77 | 5.58 | 4.23 | 7.86 | 4.70 | 
| euroSAMPL-10 | 4.65 | 4.23 | 4.84 | 4.20 | 4.21 | 5.49 | 4.46 | 2.71 | 4.57 | 5.02 | 4.26 | 5.18 | 
| euroSAMPL-11 | 3.73 | 3.91 | 3.45 | 5.33 | 4.20 | 2.88 | 3.88 | 3.40 | 4.73 | 2.68 | 2.51 | 3.52|−0.33 | 
| euroSAMPL-12 | 3.67 | 3.58 | 3.59 | 3.39 | 4.15 | 3.29 | 4.16 | 3.24 | 1.84 | 2.28 | −0.53 | 5.14 | 
| euroSAMPL-13 | 8.15 | 8.87 | 8.25 | 8.93 | 7.88 | 8.04 | 8.46 | 6.46 | 10.15 | 10.65 | 5.91 | 9.53 | 
| euroSAMPL-14 | 7.41 | 10.69 | 7.96 | 6.91 | 7.53 | 7.08 | 7.78 | 2.90 | 5.15 | 7.28 | −1.23|0.75 | 6.84 | 
| euroSAMPL-15 | 5.18 | 5.57 | 6.20 | 6.74 | 6.88 | 7.53 | 4.02 | 2.26 | 8.13 | 9.72 | 2.08 | 8.43 | 
| euroSAMPL-16 | 9.46 | 9.24 | 9.40 | 8.80 | 8.90 | 9.21 | 8.98 | 7.87 | 13.34 | 12.11 | 7.13 | 10.25 | 
| euroSAMPL-17 | 3.79 | 3.97 | 3.83 | 3.14 | 3.98 | 4.10 | 4.55 | 3.92 | 3.00 | 3.08 | 2.78 | 4.78 | 
| euroSAMPL-18 | 8.96 | 9.88 | 9.46 | 8.76 | 5.47 | 8.67 | 9.31 | 9.21 | 11.15 | 9.61 | 10.73 | 9.10 | 
| euroSAMPL-19 | 6.76 | 5.75 | 7.57 | 6.25 | 5.42 | 6.84 | 5.87 | 4.98 | 5.53 | 4.16 | 19.65|6.49 | 6.28 | 
| euroSAMPL-20 | 4.24 | 4.11 | 4.10 | 4.24 | 4.05 | 4.33 | 4.06 | 3.10 | 4.18 | 4.26 | 2.75 | 3.18 | 
| euroSAMPL-21 | 3.05 | 3.39 | 3.02 | 3.23 | 4.05 | 3.68 | 3.86|2.85 | 2.42 | 4.43 | 0.84 | 0.39 | 4.28 | 
| euroSAMPL-22 | 8.93 | 8.18 | 9.37 | 7.94 | 8.86 | 8.82 | 8.26 | 6.46 | 11.52 | 12.06 | 7.97 | 9.67 | 
| euroSAMPL-23 | 3.16 | 3.75|2.73 | 3.49 | 3.32 | 4.37 | 3.35 | 3.44|3.39 | 2.81 | 2.46 | 2.55 | 0.87 | 3.36|−0.04 | 
| euroSAMPL-24 | 7.63 | 8.67 | 9.31 | 8.37 | 6.24 | 8.84 | 7.52 | 9.73 | 13.07 | 3.10 | 10.92 | 6.12 | 
| euroSAMPL-25 | 5.75 | 6.25 | 4.90 | 6.32 | 5.93 | 6.28 | 5.54 | 7.54 | 3.12 | 4.15 | 7.62 | 4.81 | 
| euroSAMPL-26 | 9.24 | 10.14|5.00 | 9.51 | 8.88 | 4.81 | 2.22|8.71 | 3.16|9.64 | 11.42 | 11.21 | 3.52 | 11.45 | 9.20|0.25 | 
| euroSAMPL-27 | 4.47 | 4.29 | 3.91 | 3.82 | 4.43 | 5.12 | 4.26 | 7.35 | 6.74 | 0.39 | 3.19 | 3.79 | 
| euroSAMPL-28 | 5.95 | 5.73|3.74 | 6.10 | 5.53 | 4.78 | 5.93 | 5.82 | 6.24 | 4.38 | 3.51 | 5.31 | 4.77|0.42 | 
| euroSAMPL-29 | 3.28 | 3.03 | 3.63 | 4.68 | 3.89 | 3.26 | 4.30|4.21 | 1.42 | 4.34 | 2.87 | 3.03 | 2.20 | 
| euroSAMPL-30 | 7.89 | 7.97 | 9.06 | 6.41 | 6.94 | 8.74 | 7.38 | 5.85 | 8.61 | 7.13 | 20.54|6.24 | 8.91 | 
| euroSAMPL-31 | 2.87 | 3.52 | 3.35 | 3.07 | 3.71 | 2.81 | 3.56 | 2.27 | 2.10 | 1.50 | −0.69 | 3.65 | 
| euroSAMPL-32 | 6.64 | 6.54 | 7.27 | 7.03 | 7.39 | 6.58 | 6.64 | 8.70 | 9.80 | 4.15 | −8.38 | 8.00 | 
| euroSAMPL-33 | 6.79 | 6.09 | 6.39 | 6.17 | 5.78 | 6.84 | 6.64 | 7.91 | 4.34 | 4.74 | 7.62 | 5.53 | 
| euroSAMPL-34 | 4.04 | 4.07 | 4.02 | 3.73 | 3.91 | 4.32 | 4.08 | 3.95 | 3.75 | 4.11 | 2.26 | 4.90 | 
| euroSAMPL-35 | 6.30 | 6.33 | 6.31 | 6.19 | 5.68 | 7.04 | 5.81 | 6.36 | 8.38 | 3.48 | 5.10 | 4.42 | 
It has been argued that the “best” matching approach is always the appropriate one,59 because if the computational method predicts multiple protonation state transitions within the experimental range, the researcher's choice of one over the other is, to a degree, arbitrary. A rational decision can only be made if one prediction is significantly further away from the limit of the experimental range than the other one, even when accounting for expected experimental and prediction errors. If this is not the case, choosing the transition detected in the experiment is based on luck. We will get back to this point below for the discussion of individual results.
On the other hand, allowing the submission of multiple pKa values and picking the one closest to the experimental value for analysis and comparison is only a fair method if the quality of the prediction is already known to be reasonably high. This is particularly true in the case of only a single measured pKa value, where the order of protonation states cannot be used as an additional constraint. If one considers methods that potentially exhibit high errors in their predictions, which are explicitly encouraged to participate in this kind of blind prediction challenge to identify why or for which molecules such errors occur, it is possible that the “best” predicted value stems from a different protonation state transition than the one observed experimentally. While the argument that forcing the researcher to choose one of their predictions to be the “correct” one makes the comparison based on arbitrary factors has merit, we believe that the comparison of both matching methods has advantages for the purpose of blind prediction challenges. A method that coincidentally predicts the “correct” pKa while predicting the wrong charge states around this transition is unusable for many practical applications.
Unlike during, e.g., the SAMPL6 challenge32 there is no ambiguity that would necessitate deciding on a specific matching algorithm, as each molecule only has a single experimental pKa value to which a single prediction must be matched.
As shown in Table 5 and Fig. 4(A), using the “first” type of matching, the three best-performing submissions are described by the authors as “SP1” (1), “r2SCAN-3c/DRACO + ML” (2), and “CBio3Lap_pKa” (3), with RMSEs of 0.53, 0.81, and 1.21 pK units, respectively. These performances are generally in line with or even slightly better than the results of earlier SAMPL pKa prediction challenges, where for instance the best-performing submissions achieved RMSEs of 0.68 and 0.72 during the SAMPL6 and SAMPL7 challenges, respectively.20 The submissions with these results consist of two ML and one QM + ML model.
|  | ||
| Fig. 4 Root mean square error (RMSE) of each ranked method's predictions, i.e. excluding reference results (10, RMSE 1.107) and Null hypothesis (11, RMSE 2.444), over the entire euroSAMPL dataset, with the method IDs and colorations set according to Table 2. Results utilizing the “first” matching approach are depicted in panel A and results utilizing the “best” matching approach in panel B, both sorted in “best” matching rank order. | ||
Utilizing the “best” predicted pKa value instead reveals a slightly different picture. In particular, the RMSEs of submissions “QupKake” (5) and “BIOVIA COSMO-RS” (4), a QM + ML and a QM model with linear correction, change their ranking from fourth and fifth place to third and overall best prediction, respectively. This indicates that, while the correct transitions were calculated, they were not identified as the most likely to occur within the experimental range, leading to significant deviations for a few individual challenge compounds when using the “first” matching approach. In the case of “QupKake” (5) the change in the RMSE from 1.67 to 0.51 is caused by just two compounds, euroSAMPL-07 and euroSAMPL-26. In both cases the submission includes two pKa values, 11.75 and 3.98 for the former, and 3.16 and 9.64 for the latter, while the experimentally determined values are 9.24 and 4.58, respectively. In these cases, a different heuristic for deciding on the “first” pKa value, such as choosing the value that is farthest away from the limits of the experimental range, would have yielded the same results as the “best” matching for these two compounds. As for most other compounds this was in fact the case for this submission it would have been valuable to add the method of deciding on the “first” pKa value to the method description in the metadata file to enable deeper investigation.
For the submission “BIOVIA COSMO-RS” (4) the difference in the RMSEs is slightly smaller, with 1.39 for “first” and 0.73 for “best” matching, and it is caused by only a single compound, again euroSAMPL-26. In this case the first submitted pKa value is 2.22, and the second 8.71, and it is the only compound for which two pKa values were submitted. Further investigation of the compound euroSAMPL-26, which was identified by both submissions employing “BIOVIA COSMO-RS” (4) and “QupKake” (5) as having an additional protonation state within the experimental range, revealed that our reference calculations assigned this transition a pKa value of 0.25 (assuming an additional −1 → 0 transition), well outside the experimental range.
During the 18th German Conference of Cheminformatics (https://www.gdch.de/gcc2024), the author of the submission “SP1” (1) presented additional compounds for which the method predicted multiple protonation state transitions within the experimental range. In light of this post-challenge analysis (to trigger this is actually one goal of organizing blind prediction challenges) we reviewed our initial micro- and macrostate set also for compounds euroSAMPL-09, euroSAMPL-12, and euroSAMPL-28. This led to the inclusion of a +1 state for euroSAMPL-09, a −2 and +1 state for euroSAMPL-12, and a +2 state for euroSAMPL-28. Corresponding EC-RISM-predicted macroscopic pKa values are 4.19 (0 → +1) for euroSAMPL-09 (orig. 4.70 for −1 → 0, exp.: 4.40), 1.83 (0 → +1) and 12.08 (−2 → −1) for euroSAMPL-12 (orig. 5.14 for −1 → 0, exp.: 3.67), and 0.42 (+1 → +2) for euroSAMPL-28 (orig. 4.77 for 0 → +1, exp.: 5.95). This means that only euroSAMPL-09 very likely exhibits one additional transition that is fully within the experimental range and was missed in our early assessment. “SP1” (1) indeed submitted two values for euroSAMPL-09, for which the “first” submission correctly matched the experimental reference better. The authors also submitted two values for euroSAMPL-28, again “first” matching best. For euroSAMPL-12, only one value was submitted, in line with the EC-RISM analysis that other values lie outside the experimental pH range.
The experimental data for some of the discussed compounds like euroSAMPL-26 also indicates signs of another protonation state, though insufficient measurement points could be collected inside the experimental pH range from 2 to 12. This is most notably also the case for euroSAMPL-3 and euroSAMPL-28 in the lower pH range, but in all of these cases, the potentiometric transition is only starting within the experimental range, and there is no sign of the inflection point and leveling off after the shift in the raw data. In retrospect, it might have been more unambiguous to define the experimental range investigated in the challenge more restrictively, e.g. as “experimental values between 2.5 and 11.5”. This would have accounted for the need to see most or all of the potentiometric transition to determine an experimental value for the pKa, whereas at a molecule's predicted pKa, at most 50% of the species is (de-)protonated, if no other pKa value is in close proximity. On the other hand, the “experimental range” of the potentiometric measurements ranged from a pH of 2 to 12. This knowledge should have allowed for a rational decision about the single submitted pKa value by discounting values closer to the edge of the experimental range.
For euroSAMPL-09 the issue is more complex: the close proximity of the predicted pKa values makes it impossible to clearly distinguish two different protonation state transitions. The additional pKa value should have been detected during the pre-challenge analysis, and in that case the compound would have been removed from the dataset, but the critical microstate where the amine nitrogen is protonated had not been automatically identified. This suggests that in future challenges multiple, orthogonal approaches for the detection of protonation states and their underlying tautomers should be used to minimize the chances of this occurring. Due to the good performance of most methods on this compound, the relative ranking of the methods would not change upon removal of euroSAMPL-09 from the analysis, even for methods as close in RMSE as “SP1” (1) and “QupKake” (5), and even absolute changes are <1.5% of the RMSE in all cases.
Beyond the relative ranking of the participating methods, there is also the question of trends in the relative performance of different methods for different compounds. This can help identify issues with individual experimental results, prompting a reinvestigation of the experimental data, and with a given method's predictive performance on certain substance classes. The euroSAMPL1 challenge compounds can be broadly divided into containing acidic carboxyl, aromatic hydroxy, and pyrimidinone functions, and a variety of basic aliphatic and aromatic amines.
Breaking down the predictions on the individual compounds reveals that, for many molecules, there are only minor variations in the predicted pKa values. However, outliers greater than 2.5 pK units occur for five of the nine ranked submissions, namely “CBio3Lab_pKa” (3), “H2O_DFT” (6), RIJCOSX-B3LYP-D3BJ(SMD)/cc-pV(T+d)Z (7), “IEFPCM_MST” (8), and “uESE” (9), even when the “best” matching approach is used. This leads to significantly increased RMSEs for these even in cases where the prediction performance is good for the majority of the challenge dataset. It is noticeable that the two best-performing submissions in the “first” matching evaluation, “SP1” (1) and “r2SCAN-3c/DRACO + ML” (2) (see Table 5) also have the lowest maximum absolute error (MaxAE) between prediction and experimental value, indicating that the absence of outliers is key to good statistical performance. In fact, looking at the “best” matching statistics, with one exception where the RMSE and MaxAE values are very close (0.806/0.734 and 2.21/2.35, respectively), the order of the MaxAEs is the same as the order of the RMSEs. There is no significant clustering of the outliers for the same molecule predicted by different submissions, indicating that they are caused by methodological errors, either in the prediction itself or in the selection of microstates, not problems with the experimental setup. Furthermore, for 33 of the 35 compounds there is at least one prediction among all submissions that predicts the experimentally measured pKa to within 0.2 pK units. The exceptions to this are euroSAMPL-30, where the closest prediction made by submission (5) was off by 0.51, and euroSAMPL-15, where the closest prediction made by submission (1) was off by −1.02. However, even in these cases, the average deviations between predictions of the ranked submissions and experimental pKa values are 0.38 ± 1.15 and −0.83 ± 2.48, respectively, indicating only weak agreement among the different theoretical methods.
Chemically, these two compounds have in common that they are both aromatic alcohols, but a number of other compounds such as euroSAMPL-08, euroSAMPL-14, and euroSAMPL-16 do not systematically exhibit this large of a mismatch between the predictions and the experimentally measured pKa values.
As another matter of interest, even though the number of submissions is rather small, a synthetic submission using the Null hypothesis of a pKa of 7.00 for all compounds, the center of the experimental range, yields an RMSE of only 2.44, better than some of the submissions. On the other hand, using the “average” predicted pKa of all ranked submissions with their “best” matching as a prediction, would have yielded an RMSE of only 0.56 pK units, only 0.05 worse than the best-performing method “QupKake” (5). This is despite the fact that the average standard deviation of the individual predictions is as large as 1.47 pK units. Even more noticeable, restricting the average to the top 5 submissions (1)–(5) would have led to an RMSE of only 0.39, handily winning the challenge. Here, the average disagreement between methods, as measured by their predictions’ standard deviations was still 0.59 pK-units. This shows that while the individual predictions taken from different methods may in some cases miss the mark, and even disagree quite significantly with each other, their average value results in a very accurate prediction.
| ID | Q1 | Q2 | Q3 | Q4 | Ø | 
|---|---|---|---|---|---|
| (7) | 1.17 | 1.67 | 1.67 | 1.67 | 1.54 | 
| (1) | 1.50 | 2.00 | 1.67 | 2.83 | 2.00 | 
| (2) | 1.43 | 1.86 | 1.86 | 2.86 | 2.00 | 
| (9) | 1.67 | 1.67 | 2.33 | 2.33 | 2.00 | 
This exemplarily FAIR submission was followed by three equally FAIR submissions in second place, namely “SP1” (1), “r2SCAN-3c/DRACO + ML” (2), and “uESE” (9) with an identical FAIRscore of 2.00 (normalized: 0.421), a gap of 0.46 to the FAIRest submission. While the aggregate scores for these submissions are identical, the underlying individual scores, and thus the reasons for their slightly worse FAIRscores, differ. While the scores for Q1 are reasonably close together, submission “uESE” (9) scores better in reproducibility and reusability (Q2 and Q4) while scoring lower in findability (Q3). As this is the only pure physics-based QM method among these three, one needs to think about the perception of the peers when it comes to assessing “reproducibility”. One might argue that the distinction between empirical and physics-based methods is related to software availability on the one hand and to data handling on the other hand. For empirical methods, the target property can potentially be obtained from the structural input directly, hence “reproducibility” hinges upon direct access to the software. In contrast, for physics-based methods several postprocessing steps connect primary physical data with the target property in question. Hence, it is important that all raw data and metadata are available, ideally in combination with programs to extract the target property from raw data. In this case, primary software need not necessarily be available and a user might be content with the published information. In conclusion, when the software for empirical models is not freely available, peers could be tempted to assign lesser “reproducibility”.
Submission “RIJCOSX-B3LYP-D3BJ(SMD)/cc-pV(T+d)Z” (7) also provided the most extensive raw data. Only for one other submission the same output files were provided, but in that case no processed data such as energies, or the calculation of the acidity constants from the raw energies were part of the raw data. This is most noticeable in the evaluation of Q4, where the largest gaps, ranging from 0.66 to 1.19, between the highest-ranked submission and the three runners-up occurs.
Combining the normalized results of the pKa prediction with the normalized FAIRscores, as shown in Table 5, also shows that even submissions which do not perform well in the pKa prediction part of the challenge can be FAIR+R, with the best FAIRscore assigned to one of the lower-ranked submissions, “RIJCOSX-B3LYP-D3BJ(SMD)/cc-pV(T+d)Z” (7), significantly improving its combined rating from seventh to fourth rank using “best” matching and third rank using “first” matching. On the other hand, the overall combined ranking still rewards a good predictive performance, with the first and second place remaining the same due to their good FAIRscore.
One interesting result of the analysis is the very good performance of a synthetic “consensus” model. Using the mean prediction of many orthogonal methods appears to systematically improve the prediction quality, even though the methods’ individual predictions spread significantly around the mean. Similar results have been noted before,62 however in that case only empirical models were used for what the authors call “data fusion”. The phenomenon, especially including physics-based, mixed, and pure ML/empirical models, should be tested on a larger dataset of pKa values, as in practice this could allow researchers to predict pKa values more accurately by using multiple fast methods to generate a consensus prediction. If this is found to be the case, its applicability to other physicochemical properties like partition and distribution coefficients or solubilities should be investigated as well.
The novelty of this challenge was the completely redesigned approach towards improved research data management, and in this domain some key insights were obtained: as a result from analyzing the peer evaluation results, the collection of meaningful metadata and raw data appears to be easier for physics-based methods where primary physical data (such as energies per structure) are automatically generated during the calculations. Derived from post-processing of primary data by a well-defined mathematical framework, the final prediction's – i.e. the target observable's – provenance can be described in such a way that the calculation can be reproduced by others without access to the original software that produced the primary data. Conversely, empirical and ML tools often generate only a small amount of research data in the first place, unless models, i.e. software along with parameters are made available. Some might only use a molecular identifier as input to generate a prediction as output; however, the models used by these methods are more complex and would benefit from increased “FAIR+Rness” in the sense of making models freely available for maximizing reproducibility. The perception of participants appears to attribute a lesser degree of reproducibility to this class of models, just because only a small amount of research data is published.
In any case, better documentation of empirical methods is possible in order to avoid the impression of using a “black box”, but this demands considerable additional effort compared to physics-based methods. In a hallmark paper, Heil et al. defined a set of reproducibility standards for machine learning methods,23 classifying them as either “bronze”, “silver”, and “gold”. While the bronze standard is not too difficult to achieve from a technical perspective, requiring only that the data, models, and source code are published and downloadable, this may not always be desired by the authors of the model. A compromise could be to aim for at least fulfilling the FAIR standards for data, models, and source code, allowing for proprietary models and data to be used while improving overall research data management standards. Another option that has been investigated by the earlier SAMPL challenge maintainer and organizer David Mobley is the containerization of challenges, which would require the participants to submit a Docker container that, upon taking a pre-disclosed input, generates the prediction results.63 In theory, this would allow researchers to keep their model confidential, while fully disclosing it to the challenge organizers.
The silver and especially the gold standard are significantly more difficult to achieve, requiring the models to be set up in a way that enables implementation of the environment and reproduction of the results in a deterministic fashion with a single click, but it should still be a goal to strive for in the long run.
Similarly, among physics-based methods, while during this challenge the input and output files of the computations as well as the analysis of the raw data to produce the final productions have been supplied by the FAIRest submission “RIJCOSX-B3LYP-D3BJ(SMD)/cc-pV(T+d)Z“ (7), this should only be considered as the first steps towards truly FAIR+R RDM. By utilizing the suggestions for metadata field names supplied for the different computational methods, it should be possible to help the NFDI in developing standardized ontologies for domain-specific metadata that can be extended in the future when new programs or methods become available. Then, not only can the research data be annotated with FAIR metadata, but they can also be made more reproducible by automating and containerizing its generation.
Future euroSAMPL challenges will focus not just on the prediction of simple properties such as macroscopic acidity constants. Instead, molecular properties such as, e.g., microscopic acidity constants, which may be accessible with combined NMR and potentiometric measurements, will probably a major focus. Also, problems for which ML methods are usually not specifically trained are of interest, such as, e.g., temperature-dependent pKa values or non-aqueous systems. Increased collaborations with other experimental groups could also allow for a renewed focus on blind challenges for larger systems like modeling of protein–ligand or host–guest interactions, as in many of the previous original SAMPL challenges.
One additional task that must be addressed by us as challenge organizers as well as by the wider community, is the lack of gender diversity as far as the challenge participants are concerned. For instance, despite the growing number of female PIs in the field, in both academic and industry roles, no female PI participated in this challenge. Investigating the reasons for this discrepancy and increasing our outreach to foster a truly representative environment is an important task for the development of future challenges.
| This journal is © the Owner Societies 2025 |