Paul P.
Martin
and
Nicole
Graulich
*
Justus-Liebig-University Giessen, Institute of Chemistry Education, Heinrich-Buff-Ring 17, 35392 Giessen, Germany. E-mail: Nicole.Graulich@didaktik.chemie.uni-giessen.de
First published on 30th January 2023
In chemistry, reasoning about the underlying mechanisms of observed phenomena lies at the core of scientific practices. The process of uncovering, analyzing, and interpreting mechanisms for explanations and predictions requires a specific kind of reasoning: mechanistic reasoning. Several frameworks have already been developed that capture the aspects of mechanistic reasoning to support its formative assessment. However, evaluating mechanistic reasoning in students’ open responses is a time- and resource-intense, complex, and challenging task when performed by hand. Emerging technologies like machine learning (ML) can automate and advance the formative assessment of mechanistic reasoning. Due to its usefulness, ML has already been applied to assess mechanistic reasoning in several research projects. This review focuses on 20 studies dealing with ML in chemistry education research capturing mechanistic reasoning. We developed a six-category framework based on the evidence-centered design (ECD) approach to evaluate these studies in terms of pedagogical purpose, rubric design, construct assessment, validation approaches, prompt structure, and sample heterogeneity. Contemporary effective practices of ML-based formative assessment of mechanistic reasoning in chemistry education are emphasized to guide future projects by these practices and to overcome challenges. Ultimately, we conclude that ML has advanced replicating, automating, and scaling human scoring, while it has not yet transformed the quality of evidence drawn from formative assessments.
However, students will only construct such explanations and develop the desired, cognitively demanding reasoning skills if these skills are regularly assessed (Stowe and Cooper, 2017; Stowe et al., 2021; DeGlopper et al., 2022). Traditional chemistry assessment tends to emphasize rote learning strategies, such as recall, application of simple algorithms, and pattern recognition (Stowe and Cooper, 2017). For this reason, students may rely on memorization in chemistry classes (Grove and Lowery Bretz, 2012). Accordingly, instructors need tools for formative assessment that capture the quality of mechanistic reasoning. Such tools can be assessment items that allow students to explain how and why mechanisms occur (Cooper et al., 2016) so that instructors can regularly assess and foster the desired reasoning skills.
Developing high-quality science assessments that elicit an extensive set of competencies as well as content knowledge and give adequate feedback is challenging in daily teaching practices (Songer and Ruiz-Primo, 2012; Pellegrino, 2013; Pellegrino et al., 2016). Creating valid and reliable tools for drawing evidence-based inferences about mechanistic reasoning and individual support needs is particularly challenging. So far, closed-response questions, for example, single- or multiple-choice items, have often been the standard type of assessment in large classrooms. These items can be evaluated quickly, but the forced choice comes with validity threats (Birenbaum and Tatsuoka, 1987; Kuechler and Simkin, 2010; Lee et al., 2011). Students may just guess the correct answer, preventing future learning progression. Additionally, closed-response questions cannot assess students’ normative scientific and naïve ideas as accurately as human- or computer-scored open-ended items (Beggrow et al., 2014).
Hence, science education research (e.g., Haudek et al., 2019) calls for the application of open-ended assessments such as constructed responses, essays, simulations, educational games, and interdisciplinary assessments since these formats may promote detailed explanations of chemical phenomena. Until now, resource constraints like workload, costs, and time for human evaluation have often prevented teachers and faculty members from using such formative assessments regularly in large-enrollment courses. Machine learning (ML) can be used to increase the application of formative open-ended assessments (cf., Zhai et al., 2020a, 2020c) because it offers new approaches for capturing student understanding, facilitating immediate decision-making as well as action-taking (Zhai, 2021). Notably, ML eases human effort in scoring, improves approaches to evidence collection, and has great potential in tapping manifold constructs (Zhai, 2019), which enables more reasoning-focused assessment and teaching (for more information on ML terms, see the glossary in Appendix).
The primary goal of this review is to illustrate the advancements and shortcomings that ML has brought to the formative assessment of mechanistic reasoning. For that purpose, we discuss six interdependent categories: pedagogical purpose of ML application, rubric design, construct assessment, validation approaches, prompt structure, and sample heterogeneity. We emphasize the importance of these categories for designing and implementing formative assessments and provide an in-depth analysis of how mechanistic reasoning is captured in ML-based chemistry assessments. Three questions guide the objectives of this review: First, what contemporary effective practices can be deduced from current research projects? Second, how did the selected research projects already advance the field of ML-based formative chemistry assessment capturing mechanistic reasoning? Third, what are the shortcomings in applying ML to formative chemistry assessment capturing mechanistic reasoning? After analyzing the selected studies, we suggest implications for the future implementation of ML in formative assessments to extend assessment functionality.
Influenced by these considerations, three highly interrelated spaces are highlighted in the ECD (cf., Fig. 1): the claim, evidence, and task space (Pellegrino et al., 2016; Kubsch et al., 2022b). The claim space addresses the objectives of an assessment, i.e., the competencies that should be evaluated through the assessment (Messick, 1994). Consequently, before designing an assessment, one needs to analyze the targeted cognitive domain to make a precise claim about students’ competencies within this domain (Rupp et al., 2012; Mislevy, 2016; Pellegrino et al., 2016; Kubsch et al., 2022b). Making a claim includes unpacking the fine-grained, domain-specific characteristics, analyzing their connections, specifying how they contribute to the acquisition of essential competencies, and arranging all competencies in an order with increasing complexity (Pellegrino et al., 2016). In the context of mechanistic reasoning, several frameworks have been developed that characterize the specificities of this complex type of reasoning (Machamer et al., 2000; Russ et al., 2008; Kraft et al., 2010; Sevian and Talanquer, 2014; Cooper et al., 2016; Caspari et al., 2018a; Krist et al., 2019; Dood et al., 2020a; Raker et al., 2023; Yik et al., 2023). Although all these frameworks focus on the dynamic interplay of entities and activities a scalar level below the phenomenon, they concentrate on different conceptualizations and elaborateness of mechanistic reasoning (cf., Dood and Watts, 2022a). Most frameworks differentiate between forms of reasoning that capture, among others, descriptive, causal, and mechanistic reasoning (Machamer et al., 2000; Russ et al., 2008; Cooper et al., 2016; Caspari et al., 2018a). Descriptive reasoning entails identifying the entities of a mechanism as well as their explicit properties without necessarily outlining the cause for the underlying processes. Such reasoning is more focused on describing the system as a whole (Sevian and Talanquer, 2014; Dood et al., 2020a; Raker et al., 2023; Yik et al., 2023). Causal reasoning involves understanding cause-and-effect relationships between different variables and why they contribute to the outcome of a chemical phenomenon. Causal reasoning is, therefore, concerned with understanding principles or factors that are driving a reaction, rather than the specific steps or processes involved (diSessa, 1993; Carey, 1995; Russ et al., 2008; Cooper et al., 2016). Mechanistic reasoning focuses on understanding how a chemical reaction proceeds at the molecular level, including the mechanistic steps that convert the reactants into the products. So, mechanistic reasoning aims to explain electronically, how underlying processes occur (Machamer et al., 2000; Russ et al., 2008; Cooper et al., 2016; Caspari et al., 2018a, 2018b; Krist et al., 2019).
After reflecting about the claim space, evidence statements that outline the performance accepted as evidence must be composed (Mislevy, 2016; Pellegrino et al., 2016; Kubsch et al., 2022b). Formulating evidence statements involves specifying the desired competencies in detail and setting a framework for the interpretation of this evidence (Kubsch et al., 2022b). Eliciting evidence of mechanistic reasoning is a challenging, iterative procedure that requires multiple refinements of the task model and a repeated evaluation of students’ responses (Cooper et al., 2016; Noyes et al., 2022). In particular, a coding scheme should be developed based on expected and actual answers to outline how a high level of reasoning sophistication could look like (Cooper, 2015). Here, evidence rules (Mislevy et al., 2003a; Rupp et al., 2012) define how observable aspects, for instance, written phrases in students’ work products about a mechanism's plausibility, contribute to students’ competence in mechanistic reasoning. Measurement models, in turn, specify the method for making diagnostic inferences about the examined construct (Rupp et al., 2012), e.g., how phrases in students’ written accounts are coded as evidence of mechanistic reasoning.
Coherently defining evidence rules and measurement models helps develop a construct-centered task model that characterizes the form of tasks needed to collect evidence about students’ competencies (Pellegrino et al., 2016). Rather than representing a single task, task models characterize the central features that a set of potential tasks must possess as well as the elements that could be varied across different assessment contexts (Mislevy et al., 2003a; Mislevy and Haertel, 2007; Kubsch et al., 2022b). Besides defining these contexts, a task model for capturing mechanistic reasoning must find a compromise between initiating as much sophisticated reasoning as possible without providing too much information on the structure and content of a high-quality response (Noyes et al., 2022). Structuring prompts adequately and specifying the degree of scaffolding is, thus, key in developing task models (Caspari and Graulich, 2019; Graulich and Caspari, 2020; Noyes et al., 2022), for instance, that guide students in judging the plausibility of a reaction mechanism (Lieber and Graulich, 2020, 2022; Lieber et al., 2022a, 2022b; Watts et al., 2022). Consistently connecting claim, evidence, and task space ultimately enables the design and use of formative assessments to be better aligned with the construct of interest.
Compared to previous methods, ML has great potential in automatically evaluating mechanistic reasoning since these algorithms can learn from previous experience without explicit programming (Samuel, 1959). For detecting mechanistic reasoning with supervised techniques, human raters have to consistently label evidence of mechanistic reasoning in students’ responses according to a literature-based or inductively derived coding scheme (Raker et al., 2023). Afterward, the selected algorithm can be trained with the labeled data, so that it can decipher the underlying patterns. With this approach, it is not necessary to explicitly program an algorithm that, for example, distinguishes between a chain of phenomenological links and mechanistic reasoning. The level of causality, i.e., different types of reasoning, can simply be determined by training the algorithm with coherently labeled data and letting the algorithm learn from these patterns (Dood et al., 2018, 2020a; Noyes et al., 2020). In doing so, even subtleties in mechanistic reasoning can be identified if heterogenous and coherently labeled training data is available (Watts et al., 2023).
Another framework conceptualizing the implementation of ML in science assessment in general, developed by Zhai et al. (2020a), considers the dimensions of construct, automaticity, and functionality. Some cross-references between the ML-adapted ECD and Zhai et al.'s (2020a) framework can be made (cf., Fig. 2). Zhai et al.'s (2020a) dimension construct is covered by our categories rubric design and prompt structure. For automaticity, Zhai et al. (2020a) introduced, among others, the variables machine extracts attributes and model generalization. Machine extracts attributes is reflected in our category construct assessment because both deal with the fundamental methodology of an ML algorithm. Zhai et al.'s (2020a)model generalization analyzes whether an ML algorithm trained with one dataset can be applied to another dataset. Building upon this distinction, we used the category sample heterogeneity. For functionality, Zhai et al. (2020a) established, among others, the variables score use and measurement models. The variable score use evaluates whether the ML output is embedded in a learning activity, which we highlight in pedagogical purpose of ML application. Last, Zhai et al.'s (2020a) variable measurement models corresponds to our category validation approaches since both refer to the method used for the validation of an algorithm.
![]() | ||
| Fig. 2 Mapping the six categories of the machine learning-adapted evidence-centered design to the three dimensions of Zhai et al.'s (2020a) framework. | ||
To exemplify what we mean by a well-planned pedagogical purpose of ML application in terms of adaptive learning, we delineate the studies of Zhu et al. (2017), Mao et al. (2018), and Lee et al. (2021). The authors investigated students’ scientific argumentation skills with a focus on the linkage between claim and evidence, the limitations of the given data, and the factors that cause uncertainty. By doing so, they developed an ML-based instructional system that provided automatically individualized real-time feedback for students so that they had the chance to revise their arguments accordingly. Zhu et al. (2017) found that most students revised their arguments about factors that affect climate change after getting adapted feedback. Consequently, students who revised their arguments had significantly higher final scores than those who did not adjust their arguments.
Lee et al. (2021) built upon the work of Zhu et al. (2017) and Mao et al. (2018) by applying ML in three simulation-based scientific argumentation tasks. In addition to argumentation feedback, Lee et al. (2021) provided ML-enabled simulation feedback to improve students’ interactions with simulations about groundwater systems. Students who received adapted simulation feedback were more likely to re-run the simulation, leading to significantly higher scores in some tasks. This result shows that ML algorithms can provide effective individualized feedback not only for written arguments but also for simulation interactions. So, ML can be used to ease the implementation of adaptive learning.
In their study on scientific argumentation, Jescovitch et al. (2021) and Wang et al. (2021) compared a multi-level, holistic coding approach with a dichotomous, analytic coding approach. Both found that ML models trained with analytic bins performed slightly better than those trained with a holistic rubric, especially for items that introduced a complex scenario. They inferred that analytic rubrics might have the potential to identify and unpack additional complexity in students’ responses, which can be explained by the reduced human coding effort, resulting in an increase in human–human inter-rater reliability and an improvement of the model performance (Jescovitch et al., 2019a, 2021). However, there is no evidence that one coding approach outperforms the other in every context for human and machine coding (Zhai et al., 2020c). For this reason, more important than the rubric type is that rubrics allow for consistent classifications of data, both within and between human coders (Raker et al., 2023).
For the application of ML, fine-grained rubrics seem to offer deeper insights into student reasoning. Therefore, we classify binary rubrics as level 1, multi-level rubrics as level 2, and multi-level rubrics representing different levels of reasoning sophistication as level 3 in the ML-adapted ECD. Rubrics classified as level 2, in general, differentiate between incorrect, partially correct, and correct responses, whereas rubrics classified as level 3 distinguish between different levels of reasoning sophistication. In all cases, a valid rubric and an iterative process of its application and refinement are necessary to elicit the construct of interest (Allen and Tanner, 2006).
For example, lexical analysis software can automatically extract domain-specific categories that are iteratively refined by human experts afterward (Haudek et al., 2011). Defining meaningful categories can, consequently, be a major task in applying lexical analysis (Urban-Lurain et al., 2010). Categories should offer enough information to detect different sophistication levels; however, infrequently used categories have less predictive power. With statistical methods, it is, thus, possible to identify meaningful categories and collapse non-discriminatory ones. Sometimes, categories are formed based on human- or computer-set if-then commands, which come with limitations. On the one hand, case-specific custom-libraries need to be defined to decide which keywords are included in the if-then commands and how fine-grained the categories should be, which is labor- and time-intense. On the other hand, if-then commands do not allow for flexible adjustments of the categories over time. So, we define construct assessment based on if–then commands as the first level in this category.
Rather than just running human- or computer-defined commands, ML models can also gain experience on their own (Bishop, 2006). Once the algorithm has learned from the training data, it can classify or predict the characteristics of new data (cf., Appendix). For such ML algorithms, features can, for instance, be extracted with n-grams. These n-grams are sequences of n words that allow for identifying repetitive patterns, i.e., n consecutive words in students’ responses. Besides word n-grams, ML algorithms can also process character n-grams, response length, syntactic dependencies, and semantic role labels (e.g., Mao et al., 2018). These methods require less human effort, especially when a reliable scoring rubric has already been applied to many student responses (Haudek et al., 2011). However, the mentioned natural language processing techniques are built on the simplified assumption that the word order is irrelevant to the meaning of a sentence (Wulff et al., 2022b), which complicates the detection of implicit semantic embeddings. So, traditional ML models are only sensitive to key conceptual components, which is why we define construct assessment based on shallow learning experiences as the second level of the ML-adapted ECD. Here, the term experience refers to the algorithm's process of collecting information from pre-processed data to make automated decisions.
Other techniques like Bidirectional Encoder Representations from Transformers (BERT), a natural language processing technique for calculating embedding vectors, may assess the relationships between words more comprehensively since it knows contextualized embeddings (Wulff et al., 2022a). In other words, BERT is capable of grasping the implicit meaning of individual words, understands dynamic contextualized relationships of a word within any other word in a sentence, and recognizes filler words. Since BERT processes all words of a sentence simultaneously, it can detect both conceptual and semantic features, which is why we consider construct assessment based on contextualized learning experiences as the third level in this category.
Three methods can be applied to calculate the machine-human score agreement: self-validation, split-validation, and cross-validation (cf., Appendix). Nehm et al. (2012) found that their machine-human score agreement slightly decreased when validating their algorithms with a new testing set compared to the agreement reached when using the same dataset for model training and validation. In their systematic review, Zhai et al. (2020c) could underpin this result by finding that cross-validation yield better machine-human score agreements than self- and split-validation. However, using self-validation brings equity and generalizability constraints since the same data is used for the training and evaluation of the algorithm (Zhai et al., 2020a, 2020c). For this reason, we classify self- and cross-validation approaches as the first and split-validation approaches as the second level in the ML-adapted ECD. For educational purposes, the transparency of ML-based decisions should, additionally, not be underestimated (Cheuk, 2021). Besides calculating the level of agreement, it is helpful to understand why the respective model decisions were made by comparing human- and computer-set clusters, interpreting misclassifications, or conducting interviews. Analyzing ML model decisions is, therefore, the third level in this category.
Irrespective of the validation approach, various metrics are used to measure the magnitude of machine-human score agreement (cf., Zhai et al., 2020c). For a more consistent indication of machine-human score agreement, Zhai et al. (2021) call for the introduction of standardized rules, e.g., to report a confusion matrix, accuracy, Cohen's κ, and the relevant variance.
During the prompt refinement process, one has to consider not only the wording but also the structure of a prompt in terms of scaffolding (e.g., Kang et al., 2014). Scaffolding, for example, combined with contrasting cases (Graulich and Schween, 2018), helps focus students’ attention on productive concepts (Wood et al., 1976) and activate resources that might otherwise be overlooked (Hammer, 2000). When adjusting the scaffolding, one has to consider the resources that students have already applied in their reasoning, the parts of the task model that have activated those resources, and the modifications that can be implemented to activate further appropriate resources (Noyes et al., 2022). In ML-based science assessment, scaffolding can also be realized by providing multiple response boxes in which students can enter their answers (Urban-Lurain et al., 2010). This creation of separate text boxes has two major benefits. On the one hand, multiple boxes provide students with an easily accessible form of scaffolding since these boxes may help students structure their explanations as desired. On the other hand, multiple response boxes facilitate automated text analysis as shown by Urban-Lurain et al. (2010). In their study, a single text box was provided to investigate student thinking on acid–base chemistry. They realized that the students used similar terms in this box but with entirely different ideas in mind. Due to the similar wording, ML models could neither elicit the context to which the technical terms referred nor students’ cognitive concepts. The introduction of text boxes for different prompts helped them to overcome this limitation.
Designing a prompt to capture mechanistic reasoning is, considering all these aspects, a complex process requiring a detailed analysis of the phenomenon under investigation and of the competencies that students should acquire. Simply asking students to explain how and why phenomena occur, without providing comprehensive support, is often not sufficient to initiate sophisticated reasoning (Cooper, 2015). Scaffolded prompts can better guide students in activating their resources as the expectations of what a good answer should look like are more clearly outlined and assessable afterward. Based on the known effectiveness of scaffolded prompts, we define the levels unspecific (level 1), specific (level 2), and specific, scaffolded (level 3) in the category prompt structure. We understand by unspecific prompts questions that can be asked regardless of the subject domain, such as “Explain your reasoning”. Specific prompts provide better orientation for students on how to solve a task. Due to their potential instructional support on how to structure a response, specific, scaffolded prompts comprise the third level in this category.
Some studies on ML in science education already emphasized the importance of creating equitable ML models. Ha et al. (2011) investigated whether an ML model developed using biology students’ written explanations of evolutionary change from one institution can score responses from another institution. They found that their predictive model could accurately score responses at both universities for most key concepts; however, the model performance slightly decreased for a few concepts at the second university. The partly lower accuracy of the ML model was explained with unique language patterns, low concept frequencies, misspellings, and informal vocabulary. Liu et al. (2016) analyzed differential effects concerning gender, native language, and use of a computer for homework and found no generalizable differences between human and machine scoring across all items for any of these variables. Similar to Ha et al. (2011), Liu et al. (2016) traced minor differences in human and machine scoring back to misspellings and linguistic diversity. These findings led Ha and Nehm (2016) to examine the influence of misspelled words on machine scoring. Although students who learned English as an additional language produced significantly more misspellings, this did not meaningfully impact computer scoring efficiency. For a valid examination of these differential effects, a heterogeneous data sample is required (Cheuk, 2021).
Putting it all together, considerable heterogeneity of responses is needed to investigate the generalizability of an ML model, which means that data must cover all sophistication levels. If specific words, terms, or phrases are missing from the training data, the algorithm cannot predict them in the testing data. This prerequisite is reflected in the ML-adapted ECD (cf., Fig. 3). A low degree of heterogeneity (level 1) may produce a biased ML model, since students who meet specific external characteristics may be favored, whereas students who do not meet these characteristics may be disadvantaged. By low degree of heterogeneity, we understand certain external characteristics that tend to be identical across the investigated sample such as demographic background, instructor, and curricula. A high degree of heterogeneity (level 2) enlarges the training vocabulary of the ML model and increases equity by considering different demographic backgrounds, instructors, and curricula. Investigating differential effects (level 3) helps finally uncover and prevent bias.
We defined a set of keywords to look for studies that contributed to the automated analysis of mechanistic reasoning in chemistry. After an iterative process of refinement, the search term chemistry AND (“machine learning” OR “lexical analysis” OR “automated guidance” OR “automated feedback” OR “automated text analysis”) was established. We then used the three databases Google Scholar, ProQuest, and ERIC to find suitable articles. Due to the immense amount of returned results (44
600 in total), we modified where the search terms must occur. We specified that the keywords must occur in the title when using Google Scholar, in the abstract when using ProQuest, and anywhere in the article when using ERIC. Eventually, the search process resulted in 996 articles (333 Google Scholar, 640 ProQuest, 23 ERIC). First, we screened the title of these papers to sort all articles related to the field of chemistry but not to chemistry education. After that, we looked closely at the abstracts to examine whether the articles contributed to our area of interest. Additionally, we defined some including and excluding criteria that the reviewed articles must comply with.
The included studies had to meet the following criteria:
• Topic of research: research studies must focus on ML's application in chemistry education research. The participants in the studies did not have to be chemistry students exclusively.
• Goal of research: research studies must contribute to the analysis of mechanistic reasoning. In this review, mechanistic reasoning generally refers to reasoning about the underlying mechanisms of observed phenomena, i.e., general or organic chemistry phenomena. For a study to be included, it was not sufficient that chemistry content or reasoning about macroscopic phenomena was addressed. Included studies must capture reasoning about underlying mechanisms, i.e., how and why phenomena occur.
• Time range of publications: since the first paper meeting the defined selection criteria was published in 2009, we considered all articles since 2009.
The following criteria led to the exclusion of a study:
• Studies concentrating on science assessments with predominantly non-chemical content are excluded (e.g., Liu et al., 2016; Zhai et al., 2022a).
• Studies analyzing inquiry or modeling skills in drawing activities are excluded (e.g., Rafferty et al., 2013, 2014; Gerard et al., 2016; Zhai et al., 2022b).
• Studies focusing on the automated analysis of inquiry skills in designing and conducting experiments are excluded (e.g., Gobert et al., 2013, 2015; Sao Pedro et al., 2013).
• Studies using ML to evaluate student-written interpretations of chemical representations are excluded (e.g., Prevost et al., 2014).
• Studies investigating the automated analysis of student-constructed graphs are excluded (e.g., Vitale et al., 2015).
• Studies using written responses to explore methodological issues of ML, e.g., the effect of construct complexity on scoring accuracy (Haudek and Zhai, 2021), are excluded.
• Studies discussing the teacher role in ML application or their pedagogical content knowledge assessed by ML are excluded (e.g., Zhai et al., 2020b).
We screened the reference section of each selected study to find other appropriate papers. The final set of articles contains 20 studies (cf., Table 1). To be noted, Urban-Lurain et al. (2009) summarized parts of Haudek et al.'s (2009) study. Prevost et al. (2012a) document the use of an automatically evaluated item, which is even presented in a more extensive study by Prevost et al. (2012b). Dood et al. (2018) report the development of a predictive model for Lewis acid–base concept use, whereas Dood et al. (2019) focus on the pedagogical benefits of their model. Similarly, the pedagogical use of Dood et al.'s (2020a) computerized scoring model is reported in the study by Dood et al. (2020b). In this review, we refer up to now, if possible, to Haudek et al. (2009), Prevost et al. (2012b), Dood et al. (2018), and Dood et al. (2020a).
| Author | Software | Method | Education level | Domain | Science practice | Item | Sample sizes |
|---|---|---|---|---|---|---|---|
| Note: Haudek et al. (2009) and Urban-Lurain et al. (2009), Prevost et al. (2012a, 2012b), Dood et al. (2018, 2019), and Dood et al. (2020a, 2020b) refer to the same predictive model, which is why they are represented within a single row. BERT = Bidirectional Encoder Representations from Transformers, CR = constructed response, SI = simulation, FD = flow diagram, WTL = writing-to-learn. | |||||||
| Haudek et al. (2009), Urban-Lurain et al. (2009) | SPSS text analytics for surveys | Category-based lexical analysis | Under graduate | Thermodynamics, acid–base chemistry | Explanation | CR | 158/153/382 |
| Haudek et al. (2012) | SPSS text analytics for surveys | Category-based lexical analysis | Under graduate | Acid–base chemistry | Explanation | CR | 1172/323 |
| Prevost et al. (2012a, 2012b) | SPSS text analytics for surveys/SPSS modeler text analytics | Category-based lexical analysis | Under graduate | Thermodynamics | Explanation | CR | 168/329/386 |
| Liu et al. (2014) | c-Rater | Natural language processing | Middle school | Thermodynamics | Explanation | CR | 412/362/321/356 |
| Donnelly et al. (2015) | c-Rater | Natural language processing | Middle school | Thermodynamics | Inquiry | CR, SI | 346 |
| Haudek et al. (2015) | SPSS modeler text analytics | Category-based lexical analysis | Under graduate | Acid–base chemistry | Explanation | CR | 336 |
| Vitale et al. (2016) | c-Rater-ML | Support vector regressor | Middle school | Climate change | Inquiry | CR, FD, SI | 283 |
| Tansomboon et al. (2017) | c-Rater-ML | Support vector regressor | Middle school | Thermodynamics | Inquiry | CR, SI | 482 |
| Dood et al. (2018, 2019) | SPSS modeler text analytics | Category-based lexical analysis | Under graduate | Acid–base chemistry | Explanation | CR | 752/1058 |
| Haudek et al. (2019) | R/RStudio | Ensemble algorithm | Middle school | Structures and properties | Argumentation | CR | 246/775/763 |
| Dood et al. (2020a, 2020b) | SPSS modeler text analytics/python | Category-based lexical analysis | Post-secondary | Nucleophilic substitution | Explanation | CR | 1041 |
| Noyes et al. (2020) | R/RStudio | Ensemble algorithm | Under graduate | Intermolecular forces | Explanation | CR | 1730 |
| Maestrales et al. (2021) | AACR web portal | Ensemble algorithm | High school | Structures and properties | Explanation | CR | 26 800 |
| Winograd et al. (2021b) | Python | BERT/convolutional neural network | Under graduate | Chemical equilibrium | Communication | WTL | 297 |
| Yik et al. (2021) | R/RStudio | Support vector machine | Post-secondary | Acid-base–chemistry | Explanation | CR | 852 |
| Watts et al. (2023) | Python | Convolutional neural network | Under graduate | Racemization, acid hydrolysis, Wittig reaction | Explanation | WTL | 771 |
| Study | Pedagogical purpose | Rubric design | Construct assessment | Validation approaches | Prompt structure | Sample heterogeneity |
|---|---|---|---|---|---|---|
| Haudek et al. (2009), Urban-Lurain et al. (2009) |
|
|
|
|
|
|
| Haudek et al. (2012) |
|
|
|
|
|
|
| Prevost et al. (2012a, 2012b) |
|
|
|
|
|
|
| Liu et al. (2014) |
|
|
|
|
|
|
| Donnelly et al. (2015) |
|
|
|
|
|
|
| Haudek et al. (2015) |
|
|
|
|
|
|
| Vitale et al. (2016) |
|
|
|
|
|
|
| Tansomboon et al. (2017) |
|
|
|
|
|
|
| Dood et al. (2018, 2019) |
|
|
|
|
|
|
| Haudek et al. (2019) |
|
|
|
|
|
|
| Dood et al. (2020a, 2020b) |
|
|
|
|
|
|
| Noyes et al. (2020) |
|
|
|
|
|
|
| Maestrales et al. (2021) |
|
|
|
|
|
|
| Winograd et al. (2021b) |
|
|
|
|
|
|
| Yik et al. (2021) |
|
|
|
|
|
|
| Watts et al. (2023) |
|
|
|
|
|
|
Dood et al. (2020a, 2020b) developed an adaptive online tutorial to improve reasoning about nucleophilic substitution reactions. Before working with this tutorial, students’ explanations of a unimolecular nucleophilic substitution were automatically scored. Depending on the determined level of explanation sophistication, students were assigned one of two adapted tutorials that addressed leaving group departure, carbocation stability, nucleophilic addition, and acid–base proton transfer. Dood et al. (2020a, 2020b) found that completing the adapted tutorial significantly improved students’ reasoning skills. Their work illustrates how ML can be used to implement aspects of adaptive learning (level 3).
Also marked as contemporary effective practices in the category pedagogical purpose, Donnelly et al. (2015), Vitale et al. (2016), and Tansomboon et al. (2017) used ML techniques to investigate the effectiveness of automated, adaptive guidance in readily accessible online curricula units on thermodynamics and climate change. Donnelly et al. (2015) assigned middle school students to either a revisit condition, where they were prompted to review a dynamic visualization, or a critique condition, where they had to criticize a sample explanation. With automated scoring, they found that combining adapted guidance with either the revisit or critique condition helped students significantly in acquiring knowledge and revising their explanations. Here, both conditions were more effective for low-performing students indicating the importance of adapted guidance. Building upon this work, Tansomboon et al. (2017) found that transparent automated guidance, that explicitly clarified how the ML output was generated, supported low-performing students in revising an explanation better than typical automated guidance. Similarly, Vitale et al. (2016) found that content guidance, that specified missing ideas in students’ responses, resulted in instant learning gains within a simulation, but knowledge integration guidance, that delivered tailored hints, was better suited to enable knowledge transfer. Together, these studies indicate that ML output can serve as a basis for investigating further qualitative and quantitative research questions in science education.
According to these contemporary effective practices, the next steps in ML-based assessment are to explore the benefits of ML models in daily teaching practices, shifting ML-influenced research from replicating and validating human scoring to guiding students in adaptively acquiring reasoning skills. Comprehensively discussing the pedagogical purpose of ML application before designing and using an assessment helps ensure that we take advantage of ML's comprehensive pedagogical benefits.
To elicit three-dimensional learning (National Research Council, 2012), Maestrales et al. (2021) created a holistic, three-level rubric classifying responses as incorrect, correct, and multi-dimensional correct. By listing the explanatory components expected in each dimension, Maestrales et al. (2021) did not only show what an expert answer should look like but also created a valid foundation for human scoring. Liu et al. (2014), Donnelly et al. (2015), Vitale et al. (2016), and Tansomboon et al. (2017) developed holistic, multi-level knowledge integration rubrics to assess student reasoning in computer-guided inquiry learning (Linn et al., 2014). By distinguishing between off-task responses, scientifically non-normative ideas, unelaborated links, full links, and complex links, these rubrics valued students’ skills in generating ideas, applying interdisciplinary concepts, and gathering evidence (Linn and Eylon, 2011). Similarly, Dood et al. (2020a) and Noyes et al. (2020) generated holistic, three-level rubrics to categorize student thinking about substitution reactions and London dispersion forces, respectively. Dood et al. (2020a) distinguished between responses that described what happens in a reaction, why the reaction occurs at a surface level, and why the reaction occurs at a deeper level. Noyes et al. (2020) differentiated between non-electrostatic, electrostatic causal, and causal mechanistic explanations.
Just like Dood et al. (2020a) and Noyes et al. (2020), Watts et al. (2023) analyzed fine-grained reasoning features, that were in alignment with Russ et al.'s (2008) framework for mechanistic reasoning, to automatically analyze the how and why of mechanisms. In contrast to the studies mentioned above, Watts et al. (2023) applied an analytic coding approach. With convolutional neural networks they identified the presence of different features of mechanistic reasoning, e.g., identifying activities. Ultimately, their convolutional neural network provided a means to distinguish between explicit and implicit as well as static and dynamic explanations. Winograd et al. (2021b) and Watts et al. (2023) share the fine-grained analytic coding approach; however, Winograd et al.'s (2021b) approach differs in the way that four dimensions of cognitive operations (Grimberg and Hand, 2009), namely consequences, cause and effect, deduction, and argumentation, were evaluated. Winograd et al. (2021b) argue that causal reasoning includes consequences, deduction, and cause and effect; subsequently, the identification of these three operations may help detect instances of causal reasoning.
Haudek et al. (2019) combined analytic and holistic coding approaches when assessing argumentation skills of middle school students. For human coding, they applied a dichotomous analytic rubric that outlined, for instance, the presence of one claim, one piece of evidence, and three pieces of reasoning. For computer coding, the analytic components were combined into a holistic score indicating whether students formed incomplete, partially complete, or complete arguments about sugar dissolving in water.
Prevost et al. (2012b) and Haudek et al. (2015) chose a completely different approach to design rubrics for reaction spontaneity and acid–base behavior of functional groups, respectively. Based on human- and computer-set categories, both applied k-means clustering, an unsupervised ML technique, to group explanations into mutually exclusive clusters. After clustering, they investigated the key aspects of each cluster by analyzing sample responses to inductively develop rubrics. Their process of inductive rubric development has two benefits: as an unsupervised technique, the application of k-means clustering does not require human coding, which is less time- and cost-consuming. Additionally, rubrics can be built exploratively based on student responses, allowing for more extensive analyses of students’ mechanistic reasoning and for adjusting clusters easily over time.
Overall, most studies in our sample constructed ML algorithms that could accurately evaluate different levels of sophistication, illustrating already a high standard of granularity in rubric design. The high accuracy of the ML-based evaluation in these studies indicates that ML has the potential to adopt fine-grained rubrics for detailed analyses. A prerequisite, however, is that the defined levels are well-distinguishable for humans so that the algorithm can consistently decipher the underlying patterns. In the future, it might be necessary to assess not only the sophistication level but also the correctness of applied concepts for summative purposes as well as students’ mixed ideas. Applying unsupervised ML more extensively helps additionally detect patterns across all responses so that rubrics can be adjusted inductively.
Different from these research projects, two studies considered contextualized features (level 3) of written responses more comprehensively. Winograd et al. (2021b) applied pre-trained BERT models to characterize the syntactic dependencies of conceptual components according to different cognitive operations (Grimberg and Hand, 2009) while students reasoned about ocean acidification. Their method can be seen as a starting point for fostering complex reasoning in essays, providing feedback for learners and instructors about applied cognitive operations, and uncovering unproductive learning processes. With a similar goal, Watts et al. (2023) applied convolutional neural networks, a cutting-edge deep learning algorithm, to evaluate how students reason about entities, properties of entities, and activities.
Collectively, the studies show that selecting a method for natural language processing as well as a ML technique is a critical step in designing any assessment. The construct assessment method must be chosen so that detailed diagnostic information can be derived from the ML output, potentially enabling individualized feedback and adapted guidance. Due to the enormous technological progress, cutting-edge algorithms can now be employed without programming from scratch. Compared to the use of ML in earlier years, this opened up the possibility for an extensive evaluation of conceptual and semantic features in text responses.
Maestrales et al. (2021) found that responses including more advanced vocabulary like density or velocity were more likely to be misclassified. The algorithm's difficulties in dealing with advanced vocabulary could be explained by the underrepresentation of these technical terms in the training set. Since students did not frequently apply specific technical terms, the algorithm could not score them correctly. Yik et al. (2021) also investigated the shortcomings of their generalized model for evaluating the Lewis acid–base model use. They examined three reasons for false positive and two reasons for false negative predictions, i.e., cases in which the algorithm was wrong. For instance, responses that included the term electrons without referring to the Lewis acid–base theory led to false positives. Yik et al.'s (2021) identification of misclassifications can inform future research to anticipate the limitations of ML models and to adjust rubric, coding process, and choice of algorithm accordingly.
Some misclassifications can be avoided by using clustering techniques to inductively group students’ responses into mutually exclusive clusters. As outlined in the section Rubric design in ML-detected reasoning, Prevost et al. (2012b) and Haudek et al. (2015) used such an approach, namely k-means clustering, to exploratively group students’ responses based on their conceptual ideas. In this context, they compared the clusters they expected before ML analysis with those identified by the ML algorithm. This human interpretation in conjunction with ML-based analysis finally helped them derive clusters that comprehensively reflected student reasoning.
Haudek et al. (2012), in turn, investigated whether machine-scored written responses reflected students’ mental models by conducting interviews. They figured out that there was no significant difference in the number of acid–base-related categories, e.g., strong base, raise pH, and ionization, identified by an ML algorithm and used in oral interviews. With these categories, Haudek et al. (2012), subsequently, classified responses as incorrect, partially correct, and correct and found that partially correct responses were the most difficult to score accurately. The lack of accuracy was explained by the vagueness of what partially correct answers contained, as human raters interpreted the partially correct level of student sophistication differently; thus, leading to validity bias. To be noted, Donnelly et al. (2015) conducted interviews just like Haudek et al. (2012); however, the purpose of these interviews was not to analyze ML model decisions, which is why this study was not assigned to the third level of the ML-adapted ECD in validation approaches.
Together, the studies indicate that the validity features of ML-based formative assessment seem to be well-researched, yet. In almost every study, human scoring could be replicated with high accuracy, which led to major advancements in automating and scaling formative assessments. Moreover, several sources for misclassifications could as well be identified. Future research can build upon these insights to increase the machine-human score agreement of their assessment.
Dood et al. (2018, 2020a), Noyes et al. (2020), Yik et al. (2021), and Watts et al. (2023) developed their research instrument based on the findings of Cooper et al. (2016): in Dood et al. (2018, 2020a) and Yik et al. (2021), students first had to describe what happens on a molecular level for a specific acid–base reaction or a unimolecular nucleophilic substitution, respectively. After that, students had to explain why the reaction occurs. Similarly, Noyes et al. (2020) investigated learners’ explanations of London dispersion forces by asking what happens with the electrons during the attraction process and why the atoms attract. With a focus on the how and why, Watts et al. (2023) elicited students’ content knowledge of reaction mechanisms in a writing-to-learn activity by prompting them to describe and explain racemization and acid hydrolysis of a thalidomide molecule as well as a base-free Wittig reaction. In another writing-to-learn activity, Winograd et al. (2021b) asked students to respond to a fake social media prompt that addressed chemical equilibrium and Le Châtelier's principle in the authentic context of ocean acidification. By scaffolding the problem-solving process with structured questions (level 3), these studies used appropriate prompts for an ML analysis of student reasoning.
Taken together, mechanistic reasoning, with or without ML techniques, can only be assessed, if prompts elicit the desired, reasoning skills. Recent studies in chemistry education research have shown the effectiveness of scaffolded prompts (Cooper et al., 2016; Caspari and Graulich, 2019; Graulich and Caspari, 2020; Watts et al., 2021; Kranz et al., 2023). According to these findings, ML-related studies should pay attention to the structure and wording of prompts to collect evidence of mechanistic reasoning.
Overall, the reviewed studies highlight that there is a risk of producing biased ML models. When the data sample is too homogenous or when bias is introduced in the way prompts are structured, rubrics are designed, and responses are coded, ML models will replicate this bias. The development of automated scoring models goes, according to Cheuk (2021), hand in hand with questions of equity. Future studies should consider how bias is introduced into their ML model, how the parameters of an algorithm must be set to reduce bias, and which differential effects occur.
![]() | ||
| Fig. 4 Next level for implementing evidence-centered design in machine learning-based science assessment. | ||
Beyond research focusing on delivering automated guidance directly to students, there is a need for research on teacher dashboards. Showing ML output to provide instructors with insights into the ways students respond can be useful, as it helps streamline the evaluation process. Furthermore, dashboards allow instructors to tailor direct guidance and provide individualized feedback to students (Urban-Lurain et al., 2013).
However, most rubrics have been built deductively, which means that researchers set frameworks first and analyze data according to this framework afterward. Unsupervised pattern recognition hold, though, great potential in designing rubrics inductively. By applying a three-step process called computational grounded theory (Nelson, 2020), unsupervised ML can be integrated with domain-specific human expertise to raise the reliability and validity of rubrics: in a first step, unsupervised algorithms can detect patterns in data exploratively to gain breadth and quantity. After that, humans can interpret these emerging patterns to add depth and quality. In the end, algorithms can check if the qualitative human coding represents the structure of the data and if it can be applied reliably. Unsupervised ML and humans, thus, form a symbiosis in data interpretation.
For example, Rosenberg and Krist (2021) applied computational grounded theory to develop a nuanced rubric for evaluating the epistemic characteristics of the generality of model-based explanations. They combined a two-step clustering approach with qualitative, interpretative coding to derive a fine-grained rubric inductively from students’ responses. The combination of unsupervised ML and qualitative coding finally revealed the complexity of students' conceptions. In the future, this approach may help develop computerized instructional systems that cover the whole range of student sophistication, e.g., in terms of mechanistic reasoning.
Moreover, n-grams and term frequency weighting allow only for identifying and counting n adjacent words so that a response is just represented by a sequence of numbers, called a bag of words. Given that the meaning of a word significantly depends on the context, semantic, logical, and technical relationships between phrases may only be analyzed superficially with this technique. n-Grams and related approaches cannot elicit contextual embeddings, cross-references, and implicit statements. In the future, cutting-edge technologies like BERT may better reveal contextual embeddings in written responses.
Furthermore, the definition of constructs often relies on written features when capturing mechanistic reasoning, resulting in constructs that are solely based on students’ learning products (Mislevy, 2016; Zhai et al., 2020a). However, ML algorithms can include other types of data in their predictive model, e.g., log data (e.g., Zhu et al., 2017; Lee et al., 2021). Gathering process data with high-tech devices or sensory information may also facilitate the definition of new constructs. For example, Ghali et al. (2016) developed the educational game LewiSpace to teach drawing Lewis structures. They collected data about emotional and cognitive states when predicting whether a learner will succeed or fail at a level of the game. Primarily, they used three types of sensors, such as electroencephalography, eye tracking, and facial expression recognition with an optical camera, to predict learners’ success rates. Integrating knowledge from neuroscience, psychology, and computer science like behavioral, psychological, and brain wave data (e.g., Wehbe et al., 2014; Mason and Just, 2016) may finally help include state of emotion, cognition, and affect into science assessment (Kubsch et al., 2022a).
Synthesizing the reviewed studies to answer the other two guiding questions revealed that applying ML advanced the analysis of mechanistic reasoning in some categories of the ML-adapted ECD. The tenets of the evidence space are comprehensively addressed by current research studies. Due to the design of multi-level rubrics capturing different sophistication levels, evidence rules that determine how to analyze features in written responses are characterized in detail. Additionally, the technological progress of recent years has facilitated the construction of measurement models that assess conceptual and semantic features as shown by Winograd et al. (2021b) and Watts et al. (2023). Besides, validation approaches are extensively reported in the reviewed studies. Future research should, following this, use fine-grained rubrics in conjunction with cutting-edge technologies to continue these contemporary effective practices. About the task space, the analysis turns out to be more differentiated as both, advancements and shortcomings, can be identified. Developing scaffolded prompts helped elicit evidence of mechanistic reasoning as shown by many studies. However, only Noyes et al. (2020) investigated differential effects in their sample, which leads us to the conclusion that a low sample heterogeneity is a major limitation in many studies. Further shortcomings can be identified in the claim space since in more than half of the selected studies the pedagogical purpose of ML application currently does not go beyond automating the scoring of students’ mechanistic reasoning. That is, many studies did not envisage beforehand how to embed ML analysis optimally in a learning environment. To this point, research needs to be done that investigates students’ and instructors’ expectations of ML output in automated science assessment.
The identified advancements and shortcomings of our analysis are in alignment with other reviews and perspectives (Zhai et al., 2020a, 2020c; Kubsch et al., 2023). In their dimension construct, Zhai et al. (2020a) found that 81% of the reviewed studies tapped complex constructs indicating the great potential of ML in assessing such constructs. Since mechanistic reasoning is a complex construct per se (cf., Bachtiar et al., 2022), our review confirms Zhai et al.'s (2020a) finding. However, in many studies of our sample, there is no evidence that the investigated construct could solely be analyzed by ML. Instead, the primary intention of most reviewed studies was to replicate human scoring and deliver conventional science assessments faster. Hence, ML's potential in deepening human evaluation, analyzing more nuanced cognitive processes, and dealing with big data is not fully utilized, yet (Zhai et al., 2020c).
For the automaticity of an assessment, we have, just like Zhai et al. (2020a), found that most studies advanced the application of straightforward supervised techniques to automate, replicate, and scale human scoring of students’ responses from valid and reliable assessment instruments, while only a few employed unsupervised techniques to reveal hidden patterns. This finding indicates that the automaticity of scoring complex constructs with supervised techniques is already well-researched. However, as shown by the studies we reviewed, many formative assessments are only automated for a rather homogeneous sample because most studies collected data at a single institution.
Furthermore, many studies in our sample provide limited improvement to assessment functionality. Although technological advances have made it increasingly possible to capture conceptual as well as semantic features in text responses, the pedagogical purpose of applying ML is usually not envisioned a priori. This seems to be a common issue in current ML studies, as Zhai et al. (2020a, 2020c) found as well that not even half of the reviewed studies reported how to embed their assessment in a learning environment. In the future, more studies are needed to investigate the opportunities of ML for individually and adaptively supporting students’ reasoning. Integrating high-quality learning assistance into high-functional ML-based instructional settings ultimately allows for adaptively supporting students in acquiring scientific competence.
In sum, ML techniques hold great potential in supporting humans in data analysis. Subsequently, additional data sources can be combined to extend the validity of conclusions (Kubsch et al., 2021). To date, research has mostly focused on assessment automation, which only changes the quantity of research efforts, but not its quality (Nelson, 2020; Kubsch et al., 2023). Hence, we get only more, but the same insights as in traditional assessments. Focusing on automation leaves many ML methods unexplored, which is why the quality of evidence from data does not change. In the future, it is, for instance, necessary to integrate computational techniques with interpretative human coding to augmentatively raise the validity of both and gain new insights into cognitive processes (Sherin, 2013; Nelson, 2020; Nelson et al., 2021; Kubsch et al., 2023).
Along with other researchers (Zhai et al., 2020a, 2020c; Kubsch et al., 2023), we suggest exploring the transformative character of ML in science education beyond automating formative assessment. This includes moving away from perceiving humans as consumers of algorithmic output that takes certain decision-making processes away since such output rarely answers research questions related to the qualitative analysis of cognitive operations (Kubsch et al., 2023). In other words, ML algorithms do not replace human qualitative interpretation; thoughtful human interpretation is more crucial than ever.
| This journal is © The Royal Society of Chemistry 2023 |