Open Access Article
Jonathan W. Zheng
,
Olivier Lafontant-Joseph and
William H. Green
*
Massachusetts Institute of Technology, Department of Chemical Engineering, USA. E-mail: whgreen@mit.edu
First published on 24th April 2026
The acid dissociation constant (pKa) quantifies the acidity of a compound, which is crucial for applications including drug design, environmental fate studies, and chemical synthesis. However, high-quality open-source digital pKa datasets are scarce, which limits the ability for researchers to search for properties of individual compounds, while also limiting the potential of data-driven predictive models. In this work, we release the IUPAC Digitized pKa Dataset, a digital version of a critically-assessed collection of data compiled up to 1970. The dataset includes metadata such as temperature, measurement method, assessed reliability of data, and chemical identifiers such as SMILES and InChI strings. The dataset spans 24
222 entries across 10
564 unique molecules, making it the largest FAIR open-source dataset publicly available for aqueous pKa data. Herein, we detail the data digitization and checking process, and assess the informational space spanned by the data. We compare the new digital dataset to other widely-used datasets. Several pKa predictors have been trained using these other datasets, but often have not been reliably tested due to overlap between the training and test data. We use the data to train a macroscopic pKa predictor and determine its accuracy using overlap-free test data. The full dataset is available at https://doi.org/10.5281/zenodo.7236452, and the models and data splits used in this study are available at https://doi.org/10.5281/zenodo.18165948.
| AH(solv.) ⇌ H+(solv.) + A−(solv.) | (1) |
![]() | (2) |
| pKa = −log10(Ka) | (3) |
Under this definition, pKa strictly refers to proton loss. The International Union of Pure and Applied Chemistry (IUPAC) prefers to call the cation H+ a “hydron” rather than a proton, as the term encompasses such isotopes as the deuteron.13 In this work, we will use the term “proton” interchangeably with “hydron” due to its wider current acceptance.
A common way of representing proton gain is by reporting the pKa of its conjugate acid, a term sometimes called the “basic pKa”. In this convention, the “basic pKa” of compound B numerically represents the acidic dissociation constant for BH+, the conjugate acid of B, or:
| BH+(solv.) ⇌ B(solv.) + H+(solv.) | (4) |
This is sometimes confusingly referred to as the “pKa of B”, with the understanding that the value refers to the conjugate acid's acidity since B itself is basic. To avoid ambiguity in this work, we use the convention that the pKa of a species strictly refers to its acidic dissociation, as in eqn (1). We use the term pKaH(B) (as recently recommended by IUPAC)14 to designate the pKa corresponding to the conjugate acid, i.e. pKa(BH+).
Often, AH contains several ionizable protic sites, and there might be several isomers of HA in equilibrium, such as zwitterions and uncharged protomers. The macroscopic pKa corresponds to the equilibrium constant for an overall charge transition, so the species A− in eqn (1) is an equilibrium mixture of several isomers with protons bonded to different atoms. In contrast, the microscopic pKa refers to the acidity of a specific microstate (corresponding to a specific isomer of AH losing a proton at a specific ionization center to form a specific isomer of A−). Whereas the former is most commonly measured, e.g. by measuring the pH at equilibrium after half of the initial AH has been deprotonated, the latter is more readily obtainable using simulations. A macroscopic pKa can be computed from the microscopic pKa values by Boltzmann-weighting the respective microstates.15–17 For monoprotic acids without tautomers, the macroscopic and microscopic pKa values are the same. For polyprotic acids with large separation between pKa values and no tautomeric effects, they are also approximately the same. In all other cases, the macroscopic and microscopic pKa can be different. Most data in this compilation are macroscopic pKa values, and do not include any information about different acidity centers.
Owing to the importance of pKa in numerous applications, considerable effort has been made to accurately predict pKa values across a range of solvents, primarily in water. Early methods typically involved group-based methods, in which the pKa of acids are assigned based on linear free energy corrections, and often depend on specific moieties. Popular examples include the Hammett equations (for benzoic acid derivatives) and Taft equations (which estimates pKa effects of adding groups to a parent compound).18
Deep learning has recently emerged as a tool for quick and accurate prediction of chemical properties.19–25 However, large collections of data are required to tune large numbers of parameters in these models. Recently, numerous pKa prediction models leveraging deep learning have been developed and made publicly available.11,26–33 These models are trained on open-source datasets, which include large quantities of pKa data, and sometimes also calculations (or “synthetic data”) to augment the experimental data.31–34 Despite the apparent good performance of the models, low data quantity and quality remain major obstacles. Furthermore, though many recent efforts claim to predict microscopic pKa values, they are actually trained on macroscopic pKa data. A contributor to this issue is that many datasets do not list the measurement method or type of pKa, and so this information is not immediately obvious to the modelers. Several additional data-related issues are listed below.
Because the data are computed rather than experimental, and owing to the very large size of the data, they have been typically used for pretraining (though sometimes used exclusively for training). However, the data has been shown to include some incorrect values, and commonly are misused in modeling studies due to a misinterpretation of the meaning of “acidity” and “basicity” for amphiprotic compounds.40
Though widely-used and by far the largest set of values, it originates from computations, which limits the predictive power of models to the error of the calculations. Each datapoint is provided without an estimate of uncertainty, making it difficult to know which values are accurate. Also, only pKaH1 and pKa1 are reported, which limits the ability to model polyprotic compounds.
000 pKa entries across a variety of solvents, compiled from existing academic literature. It is widely used for manual search as a reference source as well as in machine learning.42
The recent work by An et al.,43 Luo et al.,32 and Nevolianis & Zheng et al.,11 have separately transformed portions of the iBonD data into tabulated forms more convenient for data science applications. The data have been used in several machine learning models.
The data entries are reported to have undergone individual curation and evaluation, and a single acidity center for each dissociation is reported. However, metadata such as temperature and assessed reliability of data are not reported. Additionally, some errors in non-aqueous data (due, for instance, to different choices of energy scales) are present.11
000 entries. Due to the lack of provenance and uncertainty estimates, the quality of the data are unclear.
000 pKa entries. The quality of the data is unclear, and distinctions are not always made between “acidic” and “basic” pKa values, so additional processing may be required; nevertheless, the data have been used for modeling.30
Another potential data source is crowdsourced data, such as in Online Chemical Modeling Environment (OCHEM), which includes provenance but includes inconsistent, mixed-quality metadata, thereby requiring additional processing before usage.46
Other data are proprietary and hence not applicable for open-source research. The largest described collections of pKa data are held by companies: for instance, the S + pKa model was trained on 70
669 datapoints that combined information from Bayer Pharma, Roche, Genentech, and Bayer CropScience.47,48
Hence, there is a need for high-quality data to use for pKa prediction, as well as further clarity on how these datasets relate to one another.
222 entries spanning 10
564 unique compounds, with pKa values spanning up to six proton gains and six proton losses. The dataset includes rich metadata not included in any other compilation, such as temperature, pressure, assessed reliability, and measurement method, as well as chemical identifiers SMILES and InChI strings.
We discuss the intended use cases for this data: as a reference source, and for machine learning. We compare this dataset to the iBonD and DataWarrior collections. We show that all commonly-used pKa datasets (including this one) include data overlaps with common benchmarks, requiring data pruning for fair comparison. Finally, we analyze macroscopic pKa models trained on variants of this data and the ChEMBL database using Chemprop.49
(1) Serjeant: ionisation constants of organic acids in aqueous solution; E. P. Serjeant and Boyd Dempsey; Oxford/Pergamon (1979) (Oxford IUPAC chemical data series).50
(2) Perrin: dissociation constants of organic bases in aqueous solution; D. D. Perrin; Butterworths (1965).51
(3) Perrin Supplement: dissociation constants of organic bases in aqueous solution, Supplement 1972; D. D. Perrin; Butterworths (1972).52
IUPAC provided written permission to scan and digitize the data, provided that the output data are reviewed by IUPAC, posted in an IUPAC-owned repository, and support the FAIR (Findable, Accessible, Interoperable, Reusable) data principles. With their permission, we scanned and digitized the reference books. A digitized copy of Perrin (1965) was obtained from the Internet Archive with IUPAC's permission.
A commercial version of these data sources is separately available under OpenEye Scientific Software.53 That collection was parsed independently of this collection and includes additional database features such as tautomer enumeration, as well as the aqueous data from Kortum54 and the non-aqueous data from Izutsu.55 The number of data in that collection is slightly different than in this work, as we were unable to resolve the structures of some compounds and thereby did not report them.
This work is a digitized adaptation developed from IUPAC source data with permission. We emphasize that this manuscript should not be considered an official IUPAC technical report (which would be published in the journal Pure and Applied Chemistry). We make no guarantees on the faithfulness of the digitization process to the print source, or for strict adherence to IUPAC conventions.
The data are publicly available at https://doi.org/10.5281/zenodo.19112621. At the time of publication, the version of the dataset is 2.3d, and the data are provided under a CC BY-NC 4.0 license.
![]() | ||
| Fig. 1 Visualization of digitization workflow. Images were converted to text via OCR, and then processed with various cheminformatics workflows. | ||
(1) Digital scans of the reference books were obtained. With IUPAC's permission, the reference books were scanned by the authors in the MIT Libraries, or a digital copy was obtain from the Internet Archive.
(2) Optical character recognition (OCR) was employed to convert the scanned images into text, ordering the data into tables using Amazon Textract.56 Information from these tables were further parsed and processed into categories of data, e.g. pKa type, chemical name, reference, method used, and so on.
(3) The IUPAC names of the compounds were translated into SMILES and InChI strings using OPSIN,57 ChemAxon molconvert,39 PubChem,58–60 and the Chemical Identifier Resolver.61 SMILES strings were accepted only if the same string was unanimously returned by 2+ methods. Of the entries with translations missing, inconsistent, or from only one translation source, we manually parsed the IUPAC names into SMILES strings following IUPAC conventions for naming. We note that tools have recently been developed to expedite and automate this process, such as MoleculeResolver, but were not used in this work.62
(4) The dataset was standardized to reduce the number of unique entries and make the data more amenable to data processing. For instance, “room temperature” was converted to 25 °C, and pressures were parsed into a separate column wherever applicable.
(5) To resolve the pKa type, we followed patterns in the reference books to confidently assign labels for a majority of the entries (for example, in the data compiled by Perrin, pKa data in descending order corresponded to pKaH data, whereas in ascending order corresponded to amphiprotic molecules). For approximately 4000 entries, we could not automatically discern the pKa type and therefore manually assigned them based on their numeric values and chemical structures.
(6) The data were analyzed and visualized, checking also for errors and outliers.
The data were checked to identify suspicious entries that needed verification or correction, such as entries with abnormally high or low pKa values, missing locants in chemical names, common typos in the metadata, species with high deviation among multiple measurements, and implausible chemical structures.
ChemAxon's Protonation software predictions were checked against pKa values for all of the species. The majority of calculated values agreed with our experimental data within 1 pKa unit, though a few datapoints showed large disagreement, usually due to inconsistencies in acidity type assignment. If deviations exceeding 4 pKa units were observed, we manually reviewed the data in our collection and made corrections if necessary.
We reviewed entries with large deviations for the same acidity type at a given temperature. We also confirmed that pKaH values were lower than pKa values for amphiprotic compounds.
| Column header | Description |
|---|---|
| unique_ID | Identifier corresponding to each unique compound in the corresponding source; note this is unique to the entry rather than to the molecular identifier, as separate sources may contain the same compound |
| SMILES | Isomeric SMILES string canonicalized in rdkit |
| InChI | InChI string corresponding to compound; unique molecular identifiers conducive to look-up |
| pka_type | Type of acid dissociation; e.g. pKa = acid dissociation, pKaH = dissociation of conjugate acid (sometimes called “basic” pKa), pKb = base association |
| pka_value | Numerical value of pKa |
| T | Temperature of measurement (in °C), standardized to a numeric value if possible |
| remarks | Comments about this entry from the print source; e.g. ionic strength, experimental considerations |
| method | Experimental method for this entry |
| assessment | Critical assessment of source's reliability. The original authors (Perrin, Serjeant) assessed errors in pKa as: ≤0.005 for “reliable” entries, ≤0.04 for “approximate”, >0.04 for “uncertain”, and high errors for “very uncertain” |
| ref | Code corresponding to citation for original source of the data |
| ref_remarks | Additional comments that apply to all entries with the “unique_ID” with the same reference |
| entry_remarks | Additional comments that apply to all entries with this “unique_ID” |
| original_IUPAC_names | IUPAC names from the original print sources |
| name_contributors | Method(s) of obtaining SMILES strings from IUPAC names |
| num_name_contributors | Number of methods used to obtain SMILES strings from IUPAC names |
| original_IUPAC_nicknames | If applicable, secondary/common names that were also supplied in print source |
| source | Source book (Serjeant, Perrin, or Perrin Supplement; see main text) |
| pressure | If available, pressure of the measurement (units typically of atm or bar) |
| acidity_label | “A” for acidic, “AH” for conjugate acid, “B” for basic, or “other” |
| original_T | Original unprocessed temperature from the print source |
| cosolvent | Cosolvent information parsed from the entry |
We intend for this dataset to be useful for both experimental and machine learning applications. We intend that the InChI strings will allow users to look-up chemicals by a unique ID. From a computational perspective, we hope that the SMILES strings will aid the development of machine learning models and quantum chemical workflows, since many software packages requires SMILES strings as inputs.
A terminological area of note is the naming of “acidic” and “basic” pKa values for compounds that form zwitterions in solution. Historically, an ampholyte such as glycine often has its lower pKa labeled as an acidic pKa and its higher pKa as a basic one; however, such ordering is inconsistent with the microstates of the relevant protomers undergoing dissociation. Such labels have led to serious confusion in model development in recent years. For this reason, although the 3 source books used the historical “acidic” and “basic” naming precedent, we have in this work elected to use pKaH and pKa terminology, which has recently been recommended by IUPAC.14 For further discussion about potential confusion with pKa terminology, we refer interested readers to our previous work on this topic40 and to recent literature.63
222 entries spanning 10
564 unique chemical species. Most compounds include multiple pKa values with the same dissociation type but at different experimental conditions. Considering each pKa type as unique regardless of experimental condition, there are 14
681 such entries, reflecting the polyprotic and amphoteric nature of some of the 10
564 species.
The compounds are mostly small organic molecules, centered around 10 heavy atoms, with a right tail indicating some larger molecules including a few drug compounds (Fig. 2). Most of the molecules have at least one site that can accept a proton (Fig. 2c), and many compounds include at least one hydrogen bond donor site (Fig. 2d).
The distributions of pKa1 and pKaH1 shown in Fig. 3 both have peaks around 4 and 9 pKa units. The majority of pKa1 values fall in the range of 3 to 11, representing weak-to-medium acids and bases. But there are a significant number of data on very weak bases (pKaH1 < 3) and very weak acids (pKa1 > 11).
A large variety of pKa types is present in this dataset. The data include both pKaH and pKa up to six charge states (Fig. 4). Most entries are related to first and second dissociations. Just one compound includes a sixth proton gain pKaH, whereas around ten include a sixth proton loss pKa.
pKa data are recorded across a number of conditions, including temperature, pressure, experimental method, and evaluated reliability.
Most entries reported herein were measured near or at room temperature (Fig. 5). A small number of entries are measured or calculated at elevated temperatures, allowing the temperature dependence of pKa to be visualized or estimated.
Ionic strength is sometimes also recorded. For most applications of pKa, it is assumed that the ionic strength is near zero. Most of the ionic strength measurements in this dataset are low. The ionic strength is presented in many different formats; for example, either directly as ionic strength I, as c in molar concentration, as m in concentration per kilogram of water, or as κ in specific conductance of water (in 10−6 ohm−1 cm−1). In many cases the value provided is approximate, or only a range is given. At the time of publication, the ionic strength is included in the remarks column, rather than parsed in a separate column.
The vast majority of measured data in this compilation are from electrochemical or optical methods (Fig. 6), which are only capable of discerning macroscopic pKa values. Only 30 measurements were made with NMR, which can provide information about the microstates. Hence, this dataset is intended to be used as a macroscopic pKa database, though it is in principle possible to employ physical chemical information (such as through quantum chemical simulations) to decompose the macroscopic values into the corresponding microstates.
![]() | ||
| Fig. 6 Measurement methods employed for each datapoint. The majority of measurements were obtained using electrometric experiments. Darker color represents higher density of points. | ||
Table 2 shows the details of the original print sources. The works by Perrin were focused largely on basic (pKaH) values, whereas that of Serjeant was focused on acidic values. Notably, acidic pKa values prior to 1961 are missing – those were compiled by Kortum et al.54 for 1056 compounds, and published only in German. The Kortum work contains values for aliphatic and alicyclic carboxylic acids, aromatic carboxylic acids, phenolic acids, and other acids including phosphoric acid ethers, and sulfonic, phosphonic, and phosphinic acids. We unfortunately were unable to translate the Kortum work with high confidence, and hence it is not included herein. We hope to incorporate this data in the future.
| Reference source | # Entries | # Molecules | Description |
|---|---|---|---|
| perrin | 7769 | 3433 | Values mostly for organic bases, collected up to 196151 |
| perrin_supp | 7147 | 3914 | An add-on to the first Perrin compilation, collected from 1961 to 1970, mostly of values for organic bases. Includes some corrections to Perrin, which were incorporated in this digitization52 |
| serjeant | 9295 | 4108 | Supplement to work by Kortum et al.54 Data collected from 1961 to 1970, mostly of acidic dissociations50 |
N moiety, which forms a resonance-stabilized cation upon protonation.
• Findability – each compound has a unique identifier. Rich metadata are provided, with descriptions that explicitly describe the type of data.
• Accessibility – the data and metadata can be accessed freely (through free download) – currently through Zenodo – in a .csv file.
• Interoperability – the .csv format with chemical identifiers allows it to be readily used in cheminformatics workflows and beyond.
• Reusability – license information is provided in the README included in the download repository, and sources are given for each datapoint.
Because the data is released in a digital format, it is also readily updateable. Errors in the database can be pointed out by users and subsequently fixed in the datasheet.
Fig. 9 shows the chemical space spanned by the compounds, represented by a UMAP plot.64 The domain is fairly consistently covered among the IUPAC, DataWarrior, and iBonD datasets, with overlapping points spanning practically all regions of this UMAP plot. That said, there are regions of chemical space where more examples are present for certain datasets versus the others. In particular, the bottom-right region contains mostly IUPAC data, which correspond mostly to simple nitrogen-containing heterocyclic compounds (such as pteridine and pyrimidine derivatives). In contrast, the top-center, top-left, and bottommost regions of the UMAP plot are somewhat more heavily populated by the aqueous iBonD data, which correspond to sulfonamides and large, druglike molecules.
Fig. 10 shows that the distributions of the molecular weights covered by the IUPAC, iBonD, and DataWarrior sets are very similar, focusing on small organic compounds, though iBonD includes more large molecules. The great majority of the molecules in all 3 datasets have molecular weights of less than 300 amu, which may limit the applicability of these datasets for modeling larger, more complex biochemical or pharmaceutical compounds.
The distributions of data are shown in Fig. 11. Compared to iBonD, the IUPAC dataset includes more pKaH values and a stronger left tail (very weak bases). Like the others, it includes peaks at approximately 4 and 9 pKa units, and a weak right tail. These provide further evidence that the experimental datasets span a generally similar set of compounds.
Before further discussion of this dataset, we must first comment on some highly-pervasive misusage of pKa data, which has thus far evaded discussion in the literature. After addressing these issues, we then discuss the modeling results.
Four compounds in the Novartis acid and base datasets27 are also present in this dataset. Among those four, three have exact matches for pKa type; one compound is in the dataset for a different acidity type (perrin3511) and may not be considered a leak.
All of these potential data leaks, from all datasets, are presented in the Zenodo repository associated with this manuscript. As both this dataset and others may be used in machine learning efforts, future works should be careful to remove these data from the training sets if evaluating on SAMPL or Novartis data to avoid data leakage.
The SAMPL7 challenge data also include two macroscopic pKa measurements that correspond to an inexact quantity; e.g. “pKa >12.” It is not clear whether these values were pruned, or assumed to be equal to 12 when performing tests using SAMPL7. We urge researchers who use those datasets to clearly state how those values were processed.
Finally, we note that these data are for macroscopic pKa values. In order to convert these into microscopic pKa, the modeler must compute the energetics of each relevant microstate.32,33 However, this process can be computationally expensive. If the modeler is using macroscopic data to predict values pertaining to a specific acidity center, then the modeler is actually reporting a microscopic pKa, and assuming their numerical values are identical (which is true for monoprotic acids and bases, but there can be substantial differences for polyprotic species). If this is done, the modeler should clearly state this assumption.
Extracting only pKa1 and pKaH1 values with high confidence near room temperature, we trained 10 separate Chemprop models, striping the training, validation, and test data in an 80%/10%/10% split such that all data in the whole data corpus was in a test split (the IUPAC model in Fig. 12). We also trained a set of 10 models on the ChEMBL dataset (labeled ChEMBL). Finally, we created a set of ten models (labeled Finetuned), wherein each model was initialized from a ChEMBL model and then finetuned on a split of the IUPAC data. That is, for each split i from 1 to 10, a model was trained on split i of ChEMBL and finetuned on split i of IUPAC. We pruned the training and pre-training data to avoid any overlaps with the test data in both the ChEMBL and IUPAC sets, and the ChEMBL data were further pruned to remove overlapping IUPAC compounds.
We examined three test cases: (1) the held-out test data from the IUPAC dataset, mostly representing a large variety of small molecules; (2) the SAMPL dataset, representing a few classes of druglike fragments; (3) the Novartis dataset, representing a broader diversity of druglike molecules.
In total, this yields 3 sets of models (IUPAC, ChEMBL, Finetuned), differing only in the data used to train them as well as their hyperparameters; and 3 test sets (IUPAC splits, SAMPL, Novartis). We conducted Tukey's Honestly Significant Difference (HSD) test66 as implemented in SciPy,67,68 to better assess whether statistically significant differences exist between the models.
Fig. 12 shows the performances of the models along with the corresponding Tukey's HSD tests. Each model consists of ten substituent models trained on separate splits of the data. For each composite model under examination (IUPAC, ChEMBL, or Finetuned), the ten substituent models produce the distribution of RMSEs illustrated in Fig. 12.
In all cases, the model pretrained on ChEMBL data and then finetuned on IUPAC data performed the best or statistically identically to the best models. For the small molecules data, the IUPAC-based models perform the best, but on the other hand, the ChEMBL model did much better than the baseline IUPAC model on the SAMPL and Novartis test sets. Only the finetuned model appears to have the most versatile test performance, with good performance across both small molecules and druglike molecules.
The errors shown here are slightly higher than those of other recently-released models. However, we emphasize that care was taken to remove molecules that appear in the test splits from the training splits.
All the models mentioned above, including our recommended best model created from the 10 finetuned models, are available for download; see the Data Availability section. We recommend leveraging the whole ensemble of the Finetuned model to make predictions; this is readily done in the Chemprop software, which can use all 10 finetuned models to obtain an average prediction and estimate ensemble uncertainty (or leverage the other suite of uncertainty tools available).
We sometimes observed that different translation algorithms returned different SMILES strings, often due to inconsistencies in how locants are parsed. In these cases, as well as when zero or one algorithm(s) returned a valid result, we manually constructed SMILES strings based on the provided names, following IUPAC naming rules. Errors in SMILES strings may appear due to either failures in machine translation, inaccuracies in manual curation, or errors in the source material, though we expect such errors to be infrequent.
Another challenge is providing representative microstructures at the corresponding experimental conditions. Many of the compounds, including amino acids, are predominantly zwitterionic at room temperature and neutral pH conditions, which has implications for the pKa values as well. However, we herein identify those species by their neutral tautomers only, as these are easier to obtain and common to search. Also, since these are generally macroscopic pKa data, the structural distinctions are not crucial to using the data productively. We hence made no effort to represent any additional potentially relevant tautomers for each compound; each compound is labeled with just one SMILES string (and thereby one isomeric form). The SMILES strings provided in the dataset correspond to the tautomers provided by the SMILES translation software, or in the case of manual translation, by whichever form most literally was provided by the IUPAC name. Therefore, amino acids and all other potentially zwitterionic compounds are represented in their uncharged form.
Additionally, the “pKa types” of some amphoteric molecules were manually assigned. Several pK values were sometimes provided in the reference works without any indication of dissociation type (e.g. pKa versus pKaH). Although we only included pKa types we could assign with high confidence, it is still possible that some errors were made during this transcription phase.
The original tabulation of the source pKa data was not always done in a consistent fashion, including the designation of acidity types and orderings. For instance, an entry may have been provided with pK1 and pK2 values, but the values may correspond to pKaH1 and pKa1, or to two acidic dissociations, or to two pKaH values. The authors of this work used their best judgment and cross-referenced experimental data to other literature and predictions by ChemAxon's Protonation software, but it is still possible that this step has introduced additional errors, especially in the Serjeant compilation. Indeed, the Serjeant compilation recommends citing the Perrin sources as the “major source of pK values” for amphiprotic compounds, which may include duplicated entries in the acidic and basic compilations. During data validation, we checked for compounds with large discrepancies, and manually corrected those. However, there is the possibility that we missed subtler errors or mislabelings.
Some data may be duplicated between the Perrin and Serjeant compilations, as both compilations were conducted independently. We include all available sets of compounds here without combining entries that appear in both compilations.
It is possible that inconsistencies or errors exist in the original tabulation, or differences exist among the three text compilations that may contradict one another. We urge the reader to check the associated references for desired datapoints when relying on individual datapoints.
Finally, as mentioned earlier, the chemical coverage of the experimental data may not be conducive toward certain applications, e.g. for modeling large pharmaceutical compounds. We anticipate that incorporating computed thermodynamic information with the data, or explicitly partitioning pKa into group-based contributions, will help improve the generalizability of the predictive models.
We intend for this dataset to be a “living” resource, in the sense that any errors discovered throughout its usage can be corrected.
We hope that this work joins other collections of pKa data as a reference for applications in organic chemistry and computational modeling. We emphasize again that this digitization may include some errors, which can be corrected in future versions of the dataset. Because references are supplied, users can cross-reference a tabulated value with its source publication.
Finally, we hope that this work, in conjunction with high-throughput experimentation and advancements in data mining that sample diverse regimes of chemical space, will lead to the development of even more capable open-source tools and libraries for pKa prediction used broadly in chemistry-related applications.
The models and data splits used in this study are available at: https://doi.org/10.5281/zenodo.18165948. The Chemprop package can be used to run the models, and is freely available on GitHub at https://github.com/chemprop/chemprop/.
Supplementary information (SI): pdf including additional examples of data analysis, information about reference works, model training details, and data pruning. See DOI: https://doi.org/10.1039/d6ra02418a.
| This journal is © The Royal Society of Chemistry 2026 |