Brandon J.
Yik
,
Amber J.
Dood
,
Daniel
Cruz-Ramírez de Arellano
,
Kimberly B.
Fields
and
Jeffrey R.
Raker
*
Department of Chemistry, University of South Florida, Tampa, FL 33620, USA. E-mail: jraker@usf.edu
First published on 1st July 2021
Acid–base chemistry is a key reaction motif taught in postsecondary organic chemistry courses. More specifically, concepts from the Lewis acid–base model are broadly applicable to understanding mechanistic ideas such as electron density, nucleophilicity, and electrophilicity; thus, the Lewis model is fundamental to explaining an array of reaction mechanisms taught in organic chemistry. Herein, we report the development of a generalized predictive model using machine learning techniques to assess students’ written responses for the correct use of the Lewis acid–base model for a variety (N = 26) of open-ended formative assessment items. These items follow a general framework of prompts that ask: why a compound can act as (i) an acid, (ii) a base, or (iii) both an acid and a base (i.e., amphoteric)? Or, what is happening and why for aqueous proton-transfer reactions and reactions that can only be explained using the Lewis model. Our predictive scoring model was constructed from a large collection of responses (N = 8520) using a machine learning technique, i.e., support vector machine, and subsequently evaluated using a variety of validation procedures resulting in overall 84.5–88.9% accuracies. The predictive model underwent further scrutiny with a set of responses (N = 2162) from different prompts not used in model construction along with a new prompt type: non-aqueous proton-transfer reactions. Model validation with these data achieved 92.7% accuracy. Our results suggest that machine learning techniques can be used to construct generalized predictive models for the evaluation of acid–base reaction mechanisms and their properties. Links to open-access files are provided that allow instructors to conduct their own analyses on written, open-ended formative assessment items to evaluate correct Lewis model use.
Current assessment items and tools are limited in measuring understanding of acids and bases. Concept inventories and multiple-choice-based assessments exist to measure such understanding. For example, the ACID-I concept inventory is a multi-choice assessment that evaluates student conceptions about acid strength (McClary and Bretz, 2012). Other examples of concept inventories include a test measuring high school students’ understanding of acids and bases (Cetin-Dindar and Geban, 2011), the Acid–Base Reactions Concept Inventory (ABCI) used to measure understanding of acid–base reactions in high school students through postsecondary organic chemistry (Jensen, 2013), and the Measuring Concept progressions in Acid–Base chemistry (MCAB) instrument intended to address concepts covered in general chemistry (Romine et al., 2016). However, a known shortcoming of multiple-choice assessments is that students are forced to choose an answer; this may give the false illusion that students hold certain conceptions when they could have been guessing (Birenbaum and Tatsuoka, 1987). An alternative to multi-choice-based assessments are oral examinations (including research-based think-aloud interview protocols). For example, through interviews McClary and Talanquer (2011) found that students use several different models and even mixed models when explaining acid and base strength. Such Socratic, dialogue-rooted assessments are impractical, particularly in courses with large enrollments (e.g., Roecker, 2007; Dicks et al., 2012). Measurement of acid–base concept understanding is further complicated in that observation-based studies have shown that students are able to draw acid–base mechanisms, or other mechanisms without understanding the concepts behind the representation (Bhattacharyya and Bodner, 2005; Ferguson and Bodner, 2008; Grove et al., 2012).
Constructed-response assessment items that require a student to explain better measure acid–base understanding and chemistry concepts in general. Such open-ended items are vital for instructors to gain insight into students’ understanding and important for amending instruction to improve student learning (Bell and Cowie, 2001; Fies and Marshall, 2006; MacArthur and Jones, 2008). Assessments where students are free to respond in complete thoughts to demonstrate their conceptual understanding provide deeper insight to instructors and send a message to students that deep understanding is important (Birenbaum and Tatsuoka, 1987; Scouller, 1998; Cooper et al., 2016; Stowe and Cooper, 2017; Underwood et al., 2018). However, open-response items, as with oral examinations, are not pragmatic for instructors’ use in large-enrollment courses and are not feasible for use with just-in-time-teaching (Novak et al., 1999).
Computer-assisted predictive scoring models have been built to evaluate text-based responses to open-ended items. Use of predictive scoring models reduce evaluation time, making in-class use possible (e.g., Haudek et al., 2011, 2012; Prevost et al., 2016; Dood et al., 2018, 2020a; Noyes et al., 2020). Some of these models have been designed to specifically evaluate student understanding of acid–base chemistry through written explanations; for example, Haudek et al. (2012) used a predictive model to identify levels of correct explanations of acid–base chemistry use in a biology course and Dood et al. (2018, 2019) built a predictive model to classify use of the Lewis acid–base model in responses to a proton-transfer reaction. A meta-analysis of machine learning-based science assessments has shown that these techniques are primarily employed in middle/secondary and postsecondary environments, spanning the general science domain to more specific STEM disciplines (Zhai et al., 2020). These predictive models serve a multitude of functions (e.g., assigning scores, classifying responses, identification of key concepts), use a variety of computational algorithms (e.g., regression, decision trees, Bayes, and support vector machines), and are built with an array of software (e.g., SPSS Text Analytics, Python, R, c-rater, SIDE/LightSIDE; cf., Zhai et al., 2020). Such scoring models, though, have been prompt-specific, meaning that varying the prompt by an instructor may yield the scoring models invalid; thus, new predictive scoring models must be developed for each assessment item, a process that requires hundreds of responses and multiple hours of development.
Use of the Lewis model in explaining acid–base reactions is key to mastery of organic chemistry. The goal of the work we report herein is to construct a computer-assisted predictive scoring model that detects correct use of the Lewis acid–base model in response to open-ended acid–base assessment items. This work seeks to build a single generalized predictive model that has demonstrable accuracy across an array of assessment items with the potential for instructors to use the predictive scoring model to evaluate responses to assessment items beyond those reported herein. Our results provide a foundation for more complex and nuanced predictive models built by technologies based on machine learning to evaluate understanding of reaction mechanisms beyond the foundational acid–base reaction.
While the three models are interconnected, mental conceptions of acidity and basicity by learners suggest a lack of distinctness between the models: McClary and Talanquer (2011) reported that student conceptions are dependent on a compound's surface features and that students struggle to switch between models. Studies suggest that learners struggle with acid–base models across the postsecondary curriculum and even into the graduate chemistry curriculum (Bhattacharyya, 2006; Cartrette and Mayo, 2011; McClary and Talanquer, 2011; Bretz and McClary, 2015; Stoyanovich et al., 2015; Dood et al., 2018; Schmidt-McCormack et al., 2019). This key confusion originates partially in the ambiguous relationships between acid–base models, a lack of clear distinction between the models (Schmidt and Volke, 2003; Drechsler and Schmidt, 2005). Additionally, the sheer number of models may also cause confusion (Ültay and Çalik, 2016).
Students have difficulty switching between acid–base models when solving problems, especially when the Lewis model is the most appropriate model (Cartrette and Mayo, 2011; Tarhan and Acar Sesen, 2012). Model confusion results when students attempt to apply models in circumstances where they are not applicable; for example, Tarhan and Acar Sesen (2012) found that students had greatest difficulty with Lewis acid–base reactions, such as with the reaction between ammonia and trifluoroborane, in which less than half of the participants in their study could correctly identify the Lewis acid–base reaction. Students also struggle to incorporate concepts of the Lewis model within their current understanding of acids and bases; for example, Cartrette and Mayo (2011) observed that study participants struggled with using terminology such as nucleophilicity and electrophilicity when trying to apply those concepts to a proton-transfer reaction. However, Crandell et al. (2019) noted that when students are primed to consider “why a reaction can only be described using the Lewis model,” those students are more likely to describe the transfer of electrons when asked what is happening in a given reaction and why that reaction occurs.
Students struggle with defining, giving examples, and explaining the function of acids and bases (Bhattacharyya, 2006; Cartrette and Mayo, 2011; Tarhan and Acar Sesen, 2012; Schmidt-McCormack et al., 2019). Cartrette and Mayo (2011) found that second-semester organic chemistry students could correctly define and give examples of Brønsted–Lowry acids and bases; however, less than half of their sample could do the same for Lewis acids and bases. In a study by Tarhan and Acar Sesen (2012), a majority of students also could not correctly classify a substance as a Lewis base. Schmidt-McCormack et al. (2019) reported that students were able to correctly identify Lewis acids, but were unable to describe how or why the compound acts as a Lewis acid. Additionally, Bhattacharyya (2006) showed that chemistry graduate students were able to provide definitions for acids and bases but were unable to apply their mental models to different situations. This suggests that despite the centrality of Lewis acids and bases in the curriculum, many students are leaving our courses with an underdeveloped understanding of acidity and basicity.
The amphoteric property of some chemical species, i.e., the species can act as both an acid and a base, poses further complications in Lewis acid–base understanding (Schmidt and Volke, 2003; Drechsler and Schmidt, 2005; Drechsler and Van Driel, 2008). The amphoteric property of water can be explained using both the Brønsted–Lowry and Lewis models; however, that explanation fails when the Arrhenius model is invoked. Students are perhaps uncomfortable with the classification of water as an acid or a base due to a reliance on the Arrhenius model (Schmidt and Volke, 2003; Drechsler and Schmidt, 2005). Schmidt (1997) found that some students hold the conception that conjugate acid–base pairs must be charged ions; students struggled with identifying a neutral conjugate acid or base as such. While such confusion is not limited to amphoteric species, charged species versus neutral species is central to the concept of amphoterism. While the Brønsted–Lowry model can be used to describe the amphoteric property, not all amphoteric compounds are proton donors and acceptors; therefore, understanding amphoterism using the Lewis acid–base model is more generalizable.
Understanding the Lewis acid–base model is vital to explaining reactivity in organic chemistry (Shaffer, 2006; Bhattacharyya, 2013; Stoyanovich et al., 2015; Cooper et al., 2016; Dood et al., 2018). Cooper et al. (2016) found that students who utilized the Lewis model to explain an aqueous, acid–base proton-transfer reaction had a higher likelihood of producing its accepted arrow-pushing mechanism. Expanding upon that work, Dood et al. (2018) found that students who were able to explain a proton-transfer reaction using the Lewis model had higher scores on an acid–base related examination in an organic chemistry course. Crandell et al. (2019) built upon the work of Cooper et al. (2016) by adding a mechanism that can only be explained using the Lewis acid–base model, demonstrating that features of the assessment prompt influences the degree to which particular acid–base models are invoked in explanations.
There has been one instance, to date, of the development of a computer-assisted scoring model used to evaluate written responses to an acid–base assessment item in chemistry courses. Dood et al. (2018) reported a predictive scoring model that was then further optimized and generalized by Dood et al. (2019). While the model reported by Dood et al. (2018, 2019) is a key first step to solving the issue of practicality of speed in scoring and for use with large-enrollment courses, the work of Dood et al. still falls into the category of a predictive scoring model for a single assessment item. For evaluation of responses to open-ended assessment items, which includes predictive scoring models, to become widely developed, more generalizable models are necessary.
In an effort to provide more tools to evaluate understanding of Lewis acid–base chemistry coupled with a desire to advance work in predictive scoring models for assessments in STEM, we sought to develop a generalized predictive model to evaluate correct use of the Lewis acid–base model in responses to an array of assessment items.
What level of accuracy can be achieved for a generalized predictive model developed using machine learning techniques that predicts correct use of the Lewis acid–base model for a variety of constructed response items?
In total, response to 15 constructed response items were collected and used in the training set for the reported machine learning model (see Appendix 1 for a complete list of constructed response items). The items are characterized by five types: (i) aqueous proton-transfer reaction; (ii) acid–base reaction that can only be explained using the Lewis model; and why a compound can act as (iii) an acid, (iv) a base, or (v) amphoteric (see Fig. 1 for an example for each prompt type).
![]() | ||
Fig. 1 Example of constructed-response items. For a comprehensive list of all prompts used in this study, see Appendices 1 and 2. |
Constructed response items were given in a survey via Qualtrics. Participants received extra credit towards their examination grade for completing the assessment. Participants completed only one survey on acidity and basicity in the term. In total, 8520 responses from 15 different constructed response items (see Appendix 1) were collected and used in the training set of the machine learning model between Fall 2017 and Spring 2020. An additional 2162 responses from 11 constructed response items with new mechanisms and compounds were collected for additional validation of the machine learning model in Fall 2020 (see Appendix 2 for a complete list of constructed response items).
Responses classified as correct use may have solely used the Lewis model or a combination of models including the Arrhenius or Brønsted–Lowry models. However, all statements had to be correct for a given response to be classified as correct use. Representative examples of correct use responses using the example constructed-response items given in Fig. 1 are provided in Table 1.
Prompt type | Response |
---|---|
Aqueous proton transfer | “The lone pair on the oxygen is forming a new bond with the hydrogen from the HCl. The chlorine then takes the electron pair from the hydrogen chlorine bond and becomes a chlorine anion. This reaction occurs because HCl is a strong acid and water is a weak base. This causes the Cl to want to break off. This is because chlorine is a good leaving group.” |
Lewis mechanism | “A Lewis acid–base reaction is occurring creating the ammonia-trifluoroborane adduct. The lone pair on the nitrogen donates its pair of electrons to the BF3 making the boron now have a negative charge and the nitrogen now have +1 formal charge. BF3 is a very good Lewis acid as its valence shell only contains six electrons and does not have a complete octet. Nitrogen is considered the Lewis base as it donates its pair of electrons. The base forms a covalent bond with the acid making the acid–base adduct.” |
Why acid? | “There is an empty p-orbital on the aluminum atom, which can act as an acid and accept an electron pair.” |
Why base? | “A base is considered an electron pair donor. The nitrogen atom in pyridine has one lone pair, in which it can donate this pair of electrons; therefore, this makes it a base.” |
Why amphoteric? | “Ethanol can act as both an acid and a base because it can accept and donate lone pairs. When it acts as an acid, the hydroxyl proton accepts an electron pair from the strong base. When it acts as a base, it donates an electron pair to a strong acid.” |
In the first sample of 8520 responses, author BJY classified all responses. Then, author JRR independently classified 250 randomly selected responses (n = 50 of each prompt type; 3% overall). Authors BJY and JRR originally agreed on 83% (n = 208) of the items; after discussing disagreements, classifications were changed for 15% (n = 38) of the items with a 98% final agreement. Author BJY then reevaluated a random sample of 1500 responses (n = 300 of each prompt type; 17.6% overall) in light of the conversation. A discussion of disagreements is presented in the Limitations section.
In the additional set of 2162 responses, author BJY again classified all responses. Author JRR independently classified 150 randomly selected responses (n = 10 of each prompt type; 7% overall). Authors BJY and JRR originally agreed on 93% (n = 139) of the items; after discussing disagreements, classifications were changed for 5% (n = 8) of the items with a 98% final agreement. No reevaluation of the set of responses was conducted as author BJY did not change any classifications during the conversation of disagreements.
A summary of the human classifications for all responses by prompt type and training/validation set is given in Table 2.
Set | Prompt type | N | Correct use (%) | Incorrect use/non-use (%) |
---|---|---|---|---|
Training/cross-validation set | Aqueous proton transfer | 419 | 292 (70) | 127 (30) |
Lewis mechanism | 419 | 352 (84) | 67 (16) | |
Why acid? | 419 | 243 (58) | 176 (42) | |
Why base? | 419 | 231 (55) | 188 (45) | |
Why amphoteric? | 419 | 198 (47) | 221 (53) | |
Overall | 2095 | 1316 (63) | 779 (37) | |
Stratified split-validation set | Aqueous proton transfer | 100 | 70 (70) | 30 (30) |
Lewis mechanism | 100 | 88 (88) | 12 (12) | |
Why acid? | 100 | 51 (51) | 49 (49) | |
Why base? | 100 | 49 (49) | 51 (51) | |
Why amphoteric? | 100 | 48 (48) | 52 (52) | |
Overall | 500 | 306 (70) | 194 (30) | |
Remaining split-validation set | Aqueous proton transfer | 3146 | 2289 (73) | 857 (27) |
Lewis mechanism | 100 | 88 (88) | 12 (12) | |
Why acid? | 990 | 547 (55) | 443 (45) | |
Why base? | 1005 | 526 (52) | 479 (48) | |
Why amphoteric? | 1184 | 556 (47) | 628 (53) | |
Overall | 6425 | 4006 (62) | 2419 (38) |
Initial data (n = 8520) were split into two sets: training and validation. The training set consisted of a random set of 419 responses from each of the five prompts (i.e., a total of 2095 responses). The minimum number of responses of each prompt type was 519; therefore, 419 responses were chosen for the training set such that the remaining 100 responses could be set aside for validation. A stratified data set for predictive model building was chosen as to prevent model building from being heavily influenced by one prompt type. We refer to the general set of responses as the data corpus. All machine learning work was completed in RStudio version 1.2.5033 (R Core Team, 2019).
The training corpus first underwent data preprocessing. Preprocessing first involved a function to convert all characters in the corpus to lowercase (Kwartler, 2017), then all non-alphanumeric, special characters, and punctuation are removed using the ‘tm’ and ‘qdap’ packages in R (Feinerer et al., 2008; Rinker, 2020). Stopwords, commonly used words in the English language that usually provide little meaning (e.g., articles), are then removed using the ‘tm’ package (Feinerer et al., 2008). Additionally, a curated dictionary of 2413 custom stopwords was created and used to remove words; this custom dictionary was built by authors BJY and JRR by compiling words without general meaning concerning the use of acid–base models (e.g., specific names of chemical species) or words that do not specifically describe reactions.
Misspelled words were defined in a list of patterns that were substituted by corresponding replacements. This processing is needed as standard text analysis libraries do not recognize technical and discipline-specific vocabulary (Urban-Lurain et al., 2009; Dood et al., 2020a) such as the chemistry words in our data corpus. For example, misspelled words such as “nuclaeophilic” are replaced with the correct spelling, “nucleophilic.” Many studies skip this step; however, misspellings are noted as common error sources in human–computer score disagreements (e.g., Ha et al., 2011; Moharreri et al., 2014); thus, spending time to construct a database of commonly misspelled words and their many variations, as suggested by Ha and Nehm (2016), results in higher predictive model accuracy.
A process called lemmatization was used for text/word normalization. In lemmatization, inflected words are reduced such that the root word is the canonical (dictionary) form in the English language; it is usually coupled with part-of-speech tagging which is a process in which words are assigned part of speech tags associated with the language of the corpus (e.g., “was” becomes “be”). We chose singular verbs to lemmatize; for example, “attack”, “attacked”, and “attacking” all become “attacks”. Additionally, synonyms in a chemical context were grouped together; patterns with lower instance counts (e.g., “cleaves”, “disconnects”, “lyses”, “severs”, “splits”, “tears”) were replaced by the replacement (e.g., “breaks”) that was more common. A total of 1625 words to be replaced were included in the dictionary to account for misspelled words and lemmatization. This process was conducted the ‘qdap’ package in R (Rinker, 2020).
The final corpus preprocessing step was to remove leading, trailing, and excess white spaces. The ‘tm’ package was used to remove these spaces (Feinerer et al., 2008).
Following preprocessing, the next step in machine learning model building is feature extraction; this involves converting the remaining words in the data corpus (i.e., the bag of words) into independent features. The processed data corpus consists of 257 unique words called unigrams (i.e., instance of single words); Lintean et al. (2012) found that unigrams performed the best across a number of machine learning text analysis algorithms. We also considered and tested other n-grams, such as bigrams (i.e., pair of consecutive words), and a combination of both unigrams and bigrams. However, the use of unigrams was found to give the best model performance metrics (see discussion of performance metrics in Results and discussion). Our 257 unigrams were parsed into a document-term matrix, or the pattern of the presence or absence of the terms, with individual student written responses (i.e., documents) representing the rows and unigrams (i.e., terms) representing the columns in the matrix. The document-term matrix was weighted using term frequency, i.e., calculated as the number of times term t appears in the document (Kwartler, 2017). We, as well, tested other feature weightings such as term frequency-inverse document frequency which weights a feature as a product of term frequency and inverse term frequency (i.e., the log of the total number of documents divided by the number of document that term t appears; Kwartler, 2017); term frequency outperformed term frequency-inverse document frequency.
Machine learning algorithms use a document-term matrix to generate a predictive model in a process called model construction. While many machine learning algorithms exist (cf., Ramasubramanian and Singh, 2019, for an overview of possible algorithms), with several recent studies using an ensemble of algorithms (cf., Kaplan et al., 2014; Moharreri et al., 2014; Noyes et al., 2020), we chose a single algorithm for optimizing predictive performance due to a lack of interpretability of the many predictive model outputs when an ensemble of methods is used (cf., Sagi and Rokach, 2018). In other words, it is prohibitively difficult to determine contributing error and limitations, such as false positives and negatives, of each algorithm within ensemble-based models.
In this study, we use support vector machine (SVM) with a linear basis function kernel (Cortes and Vapnik, 1995) algorithm for our classification. SVM is reported as robust when compared to other algorithms for text analysis classifications (Ha et al., 2011; Nehm et al., 2012; Kim et al., 2017) and in a meta-analysis, has shown to have substantial machine-human score agreement (Zhai et al., 2021). Other algorithms were also tested: regularization (ridge regression, least absolute shrinkage and selection operator (LASSO), and elastic net), Bayesian (naïve Bayes), ensemble (random forest), and other instance-based methods (SVM with radial and polynomial basis kernel functions). A baseline model (naïve Bayes) was compared with the performance of the linear SVM classifier. For a twice repeated, two-fold cross-validation (described below), the naïve Bayes classifier performed poorly (accuracy = 56.18%, Cohen's kappa = 0.22) compared to linear SVM (accuracy = 88.93%, Cohen's kappa = 0.76). Linear SVM performed best of all the algorithms tested and was therefore used for model training.
Support vector machine with a linear basis function kernel was used for model training in our analyses. In SVM, data are first mapped in multidimensional space (Cortes and Vapnik, 1995). In this process, for linear SVM, the C or cost penalty parameter is optimized; this hyperparameter regulates how much the algorithm should avoid misclassifying the training data by looking for the optimal hyperplane to classify the data (Cortes and Vapnik, 1995; Joachims, 2002; Gaspar et al., 2012). Then, linear SVM calculates the optimal hyperplane in which it attempts to maximize the margin (i.e., greatest distance) between support vectors (i.e., data points nearest to the hyperplane) between the two classes of data (i.e., use and non-use of the Lewis model); in other words, SVM tries to find the hyperplane that best discriminates the data (Cortes and Vapnik, 1995). For our reported predictive model, hyperparameter C = 0.0055. The ‘caret’ package in R used was for model training (Kuhn, 2008).
To validate the predictive model, three validation methods are performed: (i) cross-validation, (ii) split-validation (also known as holdout validation), and (iii) external validation. First, cross-validation: this process involves breaking up the data into k groups that are repeatedly shuffled with model construction performed on k − 1 group(s) with the last group used for model cross-validation. We used a 2-fold cross-validation that was repeated twice, i.e., the data was equally split into half, trained on one half, then tested on the other; this process was repeated once more. While 5- to 10-fold cross-validation is considered standard (Rodríguez et al., 2010), a 2-fold cross-validation with a smaller number of repeats is considered acceptable when there are small samples as a result of the k-fold division (Wong and Yeh, 2020). Additionally, we did not find better performance metrics with an increase in the number of folds or repeats; a greater number of folds and repeats increases computation costs.
Second, split-validation was performed; the split validation used a stratified set and a remaining set both comprising of the remaining responses that were not used in the training set. Note that the split-validation data is different from that of the cross-validation data as the cross-validation data was constructed from the training data. The stratified set was assembled from a random set of 100 responses from each of the five prompt types (i.e., a total of 500 responses); it is inclusive of the remaining set and was assembled from all of the remaining data that was not used for model training and construction (N = 6425 total responses). We chose 100 responses from each prompt type as to mimic a typical large-enrollment organic chemistry course.
Finally, external validation was performed; the machine learning model was evaluated using a data set of responses collected after the model was constructed. This set consisted of an additional 11 items that included reactions and compounds that were not used in the 15 items used in the construction of the predictive model; the external validation set consisted of 2162 responses. The 11 prompts for external validation are reported in Appendix 2. The external validation corpus first underwent the same data preprocessing as the training data corpus. In feature extraction, the features in the validation corpus were matched to the 257 features identified in the training corpus using a match matrix function as described by Kwartler (2017).
Three other metrics are Cohen's kappa, percent accuracy, and the F1 score. Cohen's kappa is a statistic for interpreting interrater reliability testing that accounts for the probability that raters agree due to chance (Cohen, 1960). However, we can assume that both human and computer classifications are purposeful and informed; therefore, percent accuracy is a reliable measure (McHugh, 2012). Accuracy (eqn (1)) is calculated as
![]() | (1) |
Cohen's kappa and percent accuracy are typical measures reported for models developed to evaluate written responses to assessment items. However, these metrics are most accurate with balanced data sets (i.e., same number of each classification; Kwartler, 2017). For example, if there were near equal numbers of students who correctly used and did not correctly use the Lewis acid–base model in their responses to the assessment items. Our data is heavily unbalanced for most of the individual prompt types and is skewed toward correct use of the Lewis model for the overall data set (Table 2). Due to this imbalance, a more accurate model performance metric is needed: the F1 score (Kwartler, 2017). The F1 score (eqn (2)) is a classification model performance metric that attempts to balance precision (eqn (3)) and recall (eqn (4)).
![]() | (2) |
![]() | (3) |
![]() | (4) |
While F1 is commonly used as a performance metric, one limitation is that F1 is independent from TN (see eqn(2)). In unbalanced cases, such as ours, F1 can be misleading as it does not consider the proportion of each class (i.e., TP, FP, TN, and FN) in the confusion matrix (Chicco and Jurman, 2020). An alternative metric is the Matthews correlation coefficient (MCC; eqn (5)), which is advantageous as its calculation is unaffected by unbalanced datasets (Matthews, 1975; Baldi et al., 2000; Chicco and Jurman, 2020).
![]() | (5) |
Overall accuracy of our twice-repeated, 2-fold cross-validation (Table 3) is 88.93% with individual prompt-type accuracy ranging from 83.05% to 93.08%. Overall F1 score is 0.91 with individual prompt F1 scores ranging from 0.86 to 0.96; overall MCC value is 0.75 with individual prompt MCC values ranging from 0.69 to 0.81. These three metrics are in good agreement with higher accuracies having larger F1 scores and MCC values. Varying prompt accuracies, F1 scores, and MCC values are indicative that the predictive model performs better for certain prompt types (e.g., reaction mechanism that can be only explained with the Lewis model) over others (e.g., why is the compound a base). There are relatively low false negative rates (i.e., computer-scored as correct use of the Lewis model where a human classifier would have scored the response as incorrect use/non-use), with moderate rates for the false positives.
Prompt type | N | κ | Accuracy (%) | F 1 | MCC | TP (%) | FN (%) | TN (%) | FP (%) |
---|---|---|---|---|---|---|---|---|---|
Aqueous proton transfer | 419 | 0.74 | 89.26 | 0.92 | 0.73 | 93.84 | 6.16 | 78.73 | 21.27 |
Lewis mechanism | 419 | 0.72 | 93.08 | 0.96 | 0.69 | 97.73 | 2.27 | 68.66 | 31.34 |
Why acid? | 419 | 0.81 | 90.93 | 0.92 | 0.81 | 93.00 | 7.00 | 88.07 | 11.93 |
Why base? | 419 | 0.65 | 83.05 | 0.86 | 0.65 | 92.64 | 7.36 | 71.28 | 28.72 |
Why amphoteric? | 419 | 0.77 | 88.31 | 0.92 | 0.77 | 91.92 | 8.08 | 85.07 | 14.93 |
Overall | 2095 | 0.76 | 88.93 | 0.91 | 0.75 | 94.22 | 5.78 | 79.97 | 20.03 |
It can be noted that the accuracy for positive instances (TP vs. FN) is much greater than the accuracy of negative instances (TN vs. FP). This is likely due to the imbalance in the training data set of correct Lewis use instances (63%) and incorrect Lewis use/non-use instances (37%; see Table 2). This means that the training set is trained more heavily on positive instances, causing there to be a discrepancy between accuracy of correctly predicting positive instances and correctly predicting negative instances. This discrepancy is even more pronounced when looking specifically at the Lewis mechanism prompt type where the model is 97.73% accurate for positive instances and 68.66% accurate for negative instances (see Table 3).
Overall, this 2-fold cross-validation demonstrates that there are varying metrics by prompt type; however, the predictive model holds for all of these prompt types when considered as a whole and therefore there is evidence that the predictive model can be generalized for all these different prompt types.
A predictive model using lexical analysis and binomial logistic regression model predicting Lewis acid–base use, including correct and incorrect use/non-use, had an accuracy of 82% (Dood et al., 2018). The predictive model was further improved with new data to 86% accuracy (Dood et al., 2019); however, this model is only applicable to aqueous proton transfer reactions. Our results are more accurate, in general, to prior work from Dood et al. (2018, 2019). When our results are compared to other computer-assisted predictive scoring models, the predictive model we report is just as or more accurate than those predictive models developed for single assessment items. Thus, these initial findings and comparisons suggest that our generalized predictive model meets current/reported accuracy standards (Zhai et al., 2020).
The stratified split-validation set (Table 4) aims to mimic the class size of a large-enrollment organic chemistry course. This validation set has comparable accuracies, F1 scores, and MCC to the cross-validation set. The prompt asking “why a compound is a base” is the worst performing prompt type and the mechanistic prompt that can be only explained with the Lewis model is the best performing when considering accuracy and F1. However, in this stratified split-validation set and also in the remaining split-validation set, there are only 12 negative instances for the Lewis mechanism prompt type versus the 88 positive instances; thus, the accuracy for negative instances is lower than the cross-validation set and is reflected in the lower MCC value. We posit that if the Lewis mechanism prompt type had a larger sample size and a larger number of negative instances, we would see a smaller discrepancy here.
Prompt type | N | κ | Accuracy (%) | F 1 | MCC | TP (%) | FN (%) | TN (%) | FP (%) |
---|---|---|---|---|---|---|---|---|---|
Aqueous proton transfer | 100 | 0.73 | 89.0 | 0.92 | 0.71 | 95.71 | 4.29 | 73.33 | 26.67 |
Lewis mechanism | 100 | 0.45 | 90.0 | 0.94 | 0.46 | 96.59 | 3.41 | 41.67 | 58.33 |
Why acid? | 100 | 0.70 | 85.0 | 0.86 | 0.70 | 88.24 | 11.76 | 81.63 | 18.37 |
Why base? | 100 | 0.50 | 75.0 | 0.78 | 0.53 | 89.80 | 10.20 | 60.78 | 39.22 |
Why amphoteric? | 100 | 0.80 | 90.0 | 0.90 | 0.79 | 88.24 | 11.76 | 90.38 | 9.62 |
Overall | 500 | 0.69 | 85.8 | 0.89 | 0.69 | 92.81 | 7.19 | 74.74 | 25.26 |
As with the smaller sample sizes, the errors of these predictions increase, explaining the larger percentages of false positives and false negatives in addition to the slightly lower F1 scores and MCC values; although, these accuracy metrics are comparable to the cross-validation set. For the stratified split-validation set, accuracies are greater than 75% and F1 scores are greater than 0.78 which demonstrate that sufficient accuracies can be obtained with sample sizes of 100. While F1 scores indicate that the predictive model performs well in correctly classifying positive cases (i.e., human classifications are correct Lewis use), the lower MCC values for mechanism prompts that can only be explained by the Lewis model and prompts asking why a compound is a base indicate that the predictive model has a difficult time classifying negative cases (i.e., human classifications are incorrect Lewis use/non-use).
The penultimate validation test was to explore model performance metrics for all data not used in the training data set. The remaining split-validation set (Table 5) consists of all the remaining data not used in the training (and cross-validation) set. This validation set allows for appraisal of the predictive model with a large sample size, indicative of overall predictive performance. Accuracies are greater than 80% with F1 scores above 0.80; MCC values are generally above 0.60, with exceptions of the Lewis mechanism prompts (as previously discussed). A lower rate of false negatives is observed with comparably higher rates of false positives. These results suggest that the number of computer-classified correct use may be slightly inflated for this large corpus. For example, in a class of 200 students, if the model predicts that the number of correct use classifications is 175, the actual value may be slightly lower due to higher false positive than false negative rates. Overall, the predictive model performs well for each of the prompt types.
Prompt type | N | κ | Accuracy (%) | F 1 | MCC | TP (%) | FN (%) | TN (%) | FP (%) |
---|---|---|---|---|---|---|---|---|---|
Aqueous proton transfer | 3146 | 0.62 | 84.01 | 0.89 | 0.65 | 85.71 | 14.29 | 79.46 | 20.54 |
Lewis mechanism | 100 | 0.45 | 90.00 | 0.94 | 0.46 | 96.59 | 3.41 | 41.67 | 58.33 |
Why acid? | 990 | 0.73 | 86.67 | 0.88 | 0.73 | 91.59 | 8.41 | 80.59 | 19.41 |
Why base? | 1005 | 0.59 | 79.60 | 0.82 | 0.60 | 90.68 | 9.32 | 67.43 | 32.57 |
Why amphoteric? | 1184 | 0.75 | 87.67 | 0.87 | 0.76 | 90.47 | 9.53 | 85.19 | 14.81 |
Overall | 6425 | 0.67 | 84.50 | 0.88 | 0.67 | 88.07 | 11.93 | 78.59 | 21.41 |
The external validation set (Table 6) allows us to evaluate performance of the predictive model on a set of new data that includes a variety of new prompts and a new prompt type: specifically, a non-aqueous proton transfer mechanism. We recognized when planning to collect the new, external validation data that the proton-transfer reactions from which the predictive model was developed were all aqueous; thus, we included a non-aqueous proton transfer in the external validation set to further evaluate the generalizability of our predictive model. A summary of the human classifications for the external validation set is given in Table 7. This new prompt type has an accuracy of 91.3%, F1 score of 0.95, and MCC of 0.73; thus, we can conclude that the predictive model performs well when used to evaluate these new data. Additionally, not only do we see that kappa, accuracy, F1 scores, and MCC values generally increase across all prompt types, but we find that false negative and false positive rates also decrease. The high true positive rate indicates that the predictive model has high recall, being able to correctly classify a response as correct use of the Lewis model out of all the possible correct classifications given by a human classifier. These external validation results show analogous or better metrics when compared to the cross-validation and split-validations, and to other studies that use machine learning techniques (e.g., Dood et al., 2018, 2020a; Noyes et al., 2020; cf., Zhai et al., 2020). Additionally, the level of prediction accomplished by our model exceeds the 70% accuracy recommendation for use in formative assessments (cf., Haudek et al., 2012; Nehm et al., 2012; Prevost et al., 2016) and is generally within range for accepted measures for summative assessments (cf., Williamson et al., 2012). Therefore, we conclude that an accurate, generalizable, predictive model using machine learning techniques for correct use of the Lewis acid–base model was developed. However, despite this level of accuracy, we reiterate that use of this generalized predictive model should only be used with formative assessments.
Prompt type | N | κ | Accuracy (%) | F 1 | MCC | TP (%) | FN (%) | TN (%) | FP (%) |
---|---|---|---|---|---|---|---|---|---|
Non-aqueous proton transfer | 715 | 0.74 | 91.33 | 0.95 | 0.73 | 96.01 | 3.99 | 75.46 | 24.54 |
Lewis mechanism | 716 | 0.62 | 94.41 | 0.97 | 0.61 | 97.86 | 2.14 | 58.73 | 41.27 |
Why acid? | 294 | 0.81 | 89.61 | 0.93 | 0.79 | 97.81 | 2.19 | 80.18 | 19.82 |
Why base? | 292 | 0.86 | 93.15 | 0.94 | 0.85 | 94.94 | 5.06 | 90.35 | 9.65 |
Why amphoteric? | 145 | 0.88 | 93.79 | 0.93 | 0.88 | 96.92 | 3.08 | 91.25 | 8.75 |
Overall | 2162 | 0.80 | 92.74 | 0.95 | 0.78 | 96.87 | 3.13 | 80.04 | 19.96 |
Prompt type | N | Correct use (%) | Incorrect use/non-use (%) |
---|---|---|---|
Non-aqueous proton transfer | 715 | 552 (77) | 163 (23) |
Lewis mechanism | 716 | 653 (91) | 63 (9) |
Why acid? | 294 | 183 (62) | 111 (38) |
Why base? | 292 | 178 (61) | 114 (39) |
Why amphoteric? | 145 | 65 (45) | 80 (55) |
Overall | 2162 | 1631 (75) | 531 (25) |
There are several reasons why a response could be a false positive. First, student's use of the term “electrons” while identifying or discussing irrelevant features of a compound or reaction. For example, determining number of valence electrons or formal charge:
Cyclohexanaminium can act as an acid because it can donate a proton and be left with a lone pair and 2 bonds. This would neutralize its charge since, 2 electrons + 3 bonds = 5, 5 valence electrons − 5 = 0.
This specific example demonstrates that instances of terms such as, “electrons”, “bonds”, “lone”, and “pair”, terminology associated with the Lewis model, can lead the computer to predict correct Lewis acid–base model use.
Second, students mention “a lone pair” or “lone pairs” without an action verb, usually in the surface-level description of a compound. For example,
Nitrogen has a lone pair and it is the weakest amine.
In this specific instance, a structural feature of a compound is noted that is associated with correct Lewis acid–base model; however, the answer as a whole is insufficient to be classified as correct.
Additionally, responses may contain a broad example of the movement of electrons or the arrow-pushing formalism without any specific details. For example,
The reactants of the mechanism are undergoing a reaction in which the bonds are broke and then reformed to create the two products on the right. The curved arrows represent where the exchange/attraction of the electrons will be moving to create the bonds. This reaction occurs because the reactants are less stable then [sic] what they would be as the products.
We note that shorter responses without sufficient chemical terminology or longer responses without specific details may trigger our predictive model to give a false positive because the model is simply analyzing term frequencies.
(CH 3 ) 2 CHO − can act as a base because it can donate electrons.
At the other end of response length, term frequency can also play a role in longer responses when a student uses both Brønsted–Lowry and Lewis models together in a response. For example,
An amphoteric substance means it can act as both an acid and a base. Water is the most common example of this. tert-Butanol can also act as both an acid or a base. This is because the hydrogen atom attached to the oxygen can be donated making it a Brønsted–Lowry acid. However, the lone pairs on the oxygen can accept a proton (H + ion) making another bond making it a Brønsted–Lowry base.
In this instance, greater attention given to the Brønsted–Lowry base model, including explicit naming of the model gave rise to a false negative for this response. We also found instances of this in response to one of the mechanism prompts:
Part A: negatively charged ethanethiolate transfers electrons to the hydrogen atom of benzoic acid to form a single covalent bond. The electrons from the sigma bond between hydrogen and oxygen in benzoic acid gather around the oxygen atom, causing it to go from a neutral to a negative formal charge. In this way, enthanethiol [sic] is formed which has a neutral formal charge and negatively charged benzoate is produced. Part B: this reaction occurs because according to the Brønsted–Lowry definition, ethanethiolate is a base or a proton acceptor while benzoic acid is an acid or a proton donor. When ehtanetholate [sic] accepts the H atom or proton it becomes the conjugate acid, ethanethiol, and when benzoic acid gives up its proton it becomes the conjugate base, benzoate.
The first molecule in the reactant side [methoxide] is serving as a base because it is accepting a proton, while the second molecule in the reactant side [propaninium] is acting as an acid because it is donating a proton. Since the second molecule is donating its H to the first molecule, the single bond transfers as a lone pair to nitrogen. On the product side, the first molecule is the conjugate acid while the second molecule is the conjugate base. On a molecular level, the negative charge means that the atom wants to form a bond while the positive charge wants to donate hydrogen.
This response heavily invokes the Brønsted–Lowry model with concepts about accepting and donating protons in addition to conjugate acids and bases. One human classifier in the interrater discussion classified this response as incorrect use/non-use due to the lack of Lewis model use with too much discussion using the Brønsted–Lowry model. However, the other human classifier argued that while this model does focus on the Brønsted–Lowry model, there are aspects within the response (“the single bond transfers as a lone pair”) that demonstrates correct use of the Lewis model. This response was ultimately classified as correct use and the predictive model also correctly classified the response as correct use.
Prior research has indicated that students hold unclear relationships between acid–base models in their mental models (Schmidt and Volke, 2003; Drechsler and Schmidt, 2005; Bhattacharyya, 2006). Additionally, students struggled to incorporate broader models, such as the Lewis model, into more specific models, such as the Brønsted–Lowry model (Cartrette and Mayo, 2011). While use of the Lewis model has been shown to increase student performance (Dood et al., 2018), it is unclear whether students that use mixed models clearly understand when each model is appropriate, based solely on their responses, or if they default to the most specific model (e.g., Brønsted–Lowry) that can explain a phenomenon over the broader model (e.g., Lewis). The scoring model developed in our research only predicts whether a student has correctly used the Lewis acid–base model in their written response and is limited in understanding if students can differentiate between using mixed models correctly.
Written formative assessments can help students develop skills in explaining how and why reactions occur. Numerous studies have advocated for students to explain how and why to enrich their productive ideas about how reactions work (e.g., Becker et al., 2016; Cooper et al., 2016; Dood et al., 2018, 2020a; Crandell et al., 2019, 2020). Research has shown that targeted formative feedback allows for students to learn about their competency level, suggestions for improvement, and may positively affect students’ exam scores (Hattie and Timperley, 2007; Hedtrich and Graulich, 2018; Young et al., 2020). Written assessments can reveal students’ understanding or lack thereof; therefore, such assessments should serve as an approach for instructors to use to get students to think more deeply about scientific explanations.
The predictive model developed in this study is a practical, quick, and efficient way to formatively evaluate student understanding of the Lewis acid–base model. We have freely made available the files along with a set of instructions necessary for instructors to conduct their own analyses (cf., Yik and Raker, 2021); files are to be used with R, a free statistical software environment (R Core Team, 2019). Written formative assessments can be easily scored using computer-scoring models, like the one developed in this study, to support just-in-time teaching in large enrollment courses (Novak et al., 1999; Prevost et al., 2013; Urban-Lurain et al., 2013, Prevost et al., 2016). For example, instructors may administer constructed response questions as homework assignments (with low point value for completion credit) and then utilize the predictive model to classify student responses. Quantitative results are nearly instantaneous, providing quick feedback to instructors and students; the R program outputs result in a spreadsheet with paired student predictive scores that can uploaded into learning management systems with little modification. If instructors intend to hand score responses in addition to using the predictive model, we suggest that hand scoring be conducted first to avoid anchoring bias (Sherif et al., 1958). In the classroom, student responses can be used to create clicker questions and/or as a starting point for classroom discussion. Another option is that instructors can use quantitative results to reshape lessons or homework activities to promote correct understanding and use of the Lewis acid–base model.
Information provided by the predictive model can allow instructors to provide additional resources to students to support learning. One method is to couple predictive models with corresponding topic-specific online tutorials; such tutorials have been shown to increase student understanding and achievement in organic chemistry courses (e.g., O’Sullivan and Hargaden, 2014; Richards-Babb et al., 2015; Dood et al., 2019, 2020b). Tutorial-based learning interventions can be utilized to facilitate better construction of explanations when paired with adaptive learning opportunities based on quick results from computer-assisted scoring. Furthermore, online learning tools can supplement learning. One such open educational resource tool is OrgChem101 (https://orgchem101.com), which contains modules on acid–base reactions, nomenclature, and organic mechanisms; the latter two modules have been shown to increase students’ learning gains (Bodé et al., 2016; Carle et al., 2020).
There is ample opportunity to evaluate student understanding of acids and bases in the postsecondary chemistry curriculum using predictive models. For example, in our study, we primarily evaluated first semester organic chemistry students’ understanding of acid–base reaction mechanisms and why compounds can act as an acid, base, or be amphoteric using the Lewis model. As acid–base models are generally first introduced in general chemistry (Paik, 2015), further research can evaluate the effectiveness of our predictive model in this setting. Additionally, our predictive model can be evaluated with students in other organic chemistry courses; Dood et al. (2019) found that students benefited from a tutorial to review Lewis acid–base concepts at the beginning of the second-semester organic chemistry course. While the Brønsted–Lowry and Lewis models are considered in postsecondary general chemistry and organic chemistry courses, other models to describe acids and bases are introduced in upper-level courses (e.g., Raker et al., 2015). In inorganic chemistry, the Lux–Flood model is based on oxide ion donors and acceptors (Lux, 1939; Flood and Förland, 1947), the Usanovich model defines acids and bases as charged species donors and acceptors (Finston and Rychtman, 1982), and the concept of hard and soft acids and bases (HSABs) describes polarizable acids and bases as soft and nonpolarizable acids and bases as hard (Pearson, 1963). While each of these models have their own advantages and criticisms (Miessler et al., 2014), it may be beneficial for students to preferentially use and understand one of these models in particular scenarios; there are avenues to build predictive models for other acid–base contexts.
Predictive models have the potential to be a practical way to change the nature of formatively assessing student understanding. Researchers have studied reasoning in chemistry, for example: teleological (e.g., Wright, 1972; Talanquer, 2007; Caspari et al., 2018b), mechanistic (e.g., Bhattacharyya and Bodner, 2005; Ferguson and Bodner, 2008; Bhattacharyya, 2013; Galloway et al., 2017; Caspari et al., 2018a), causal (e.g., Ferguson and Bodner, 2008; Crandell et al., 2019, 2020), and causal mechanistic (e.g., Becker et al., 2016; Cooper et al., 2016; Bodé et al., 2019; Noyes and Cooper, 2019; Crandell et al., 2020). Studies have also been conducted on student reasoning through case comparisons between reaction mechanisms (e.g., Graulich and Schween, 2018; Bodé et al., 2019; Caspari and Graulich, 2019; Watts et al., 2021). Student understanding via argumentation has likewise been investigated (e.g., Moon et al., 2016, 2017, 2019; Pabuccu, 2019; Towns et al., 2019). Regardless of the mode of reasoning, these different routes offer researchers comparative and contrasting means to study student understanding. While predictive models have begun to consider classification or levels of reasoning (Haudek et al., 2012; Prevost et al., 2012; Dood et al., 2018, 2019, 2020a; Noyes et al., 2020), the question remains: can generalized predictive models be built for other chemistry concepts? The development of future predictive models have to potential to expand work communicated in this study by using levels of reasoning and additionally analyses that instructors can use to further support student learning.
Part A: describe in full what you think is happening on the molecular level for this reaction. Be sure to discuss the role of each reactant and intermediate.
Part B: using a molecular level explanation, explain why this reaction occurs. Be sure to discuss why the reactants form the products shown.
[pt-HCl]: consider the mechanism for the following acid–base reaction between water and hydrochloric acid to form hydronium and chloride ion.
[pt-HBr]: consider the mechanism for the following acid–base reaction between water and hydrobromic acid to form hydronium and bromide ion.
[pt-HI]: consider the mechanism for the following acid–base reaction between water and hydroiodic acid to form hydronium and iodide ion.
[lew-carbocation]: consider the mechanism for the reaction between ethanol and 2-methyl-2-propanylium to form tert-butyl(ethyl)oxonium.
[lew-BMe3]: consider the mechanism for the reaction between bromide and trimethylborane to form bromotrimethylborate.
[lew-BF3]: consider the mechanism for the reaction between ammonia and trifluoroborane to form the ammonia–trifluoroborane adduct.
[lew-AlCl3]: consider the mechanism for the reaction between acetone and aluminum trichloride to form the acetone–aluminum trichloride adduct.
[ampho-EtOH]: explain why ethanol, CH3CH2OH, can act as both an acid and a base.
[ampho-IPA]: explain why isopropanol, (CH3)2CHOH, can act as both an acid and a base.
[acid-AlCl3]: explain why aluminum trichloride, AlCl3, can act as an acid.
[acid-BH3]: explain why borane, BH3, can act as an acid.
[acid-carbocation]: explain why methylium, CH3+, can act as an acid.
[base-py]: explain why pyridine, C5H6N, can act as a base.
[base-NEt3]: explain why triethylamine, N(CH2CH3)3, can act as a base.
[base-PPh3]: explain why triphenylphosphine, P(C6H5)3, can act as a base.
Part A: describe in full what you think is happening on the molecular level for this reaction. Be sure to discuss the role of each reactant and intermediate.
Part B: using a molecular level explanation, explain why this reaction occurs. Be sure to discuss why the reactants form the products shown.
[pt-acetone]: consider the mechanism for the reaction between diisopropylamide and acetone to form diisopropylamine and acetone enolate.
[pt-BzOH]: consider the mechanism for the reaction between ethanethiolate and benzoic acid to form ethanethiol and benzoate.
[pt-ammonium]: consider the mechanism for the reaction between methoxide and propanaminium to form methanol and propanamine.
[lew-BCl3]: consider the mechanism for the reaction between imidazole and trichloroborane to form the imidazole–trichloroborane adduct.
[lew-AcCl]: consider the mechanism for the reaction between ethanamine and acetyl chloride to form chloro(ethylammonio)ethanolate.
[lew-CO2]: consider the mechanism for the reaction between water and carbon dioxide to form carbonic acid.
[acid-BF3]: explain why trifluoroborane, BF3, can act as an acid.
[acid-ammonium]: explain why cyclohexanaminium, (C6H11)NH3+, can act as an acid.
[base-HNEt2]: explain why diethylamine, (CH3CH2)2NH, can act as a base.
[base-propoxide]: explain why isopropoxide, (CH3)2CHO−, can act as a base.
[ampho-tBuOH]: explain why tert-butanol, (CH3)3COH, can act as both an acid and a base.
This journal is © The Royal Society of Chemistry 2021 |