When a machine detects student reasoning: a review of machine learning-based formative assessment of mechanistic reasoning

Paul P. Martin and Nicole Graulich *
Justus-Liebig-University Giessen, Institute of Chemistry Education, Heinrich-Buff-Ring 17, 35392 Giessen, Germany. E-mail: Nicole.Graulich@didaktik.chemie.uni-giessen.de

Received 21st October 2022 , Accepted 28th January 2023

First published on 30th January 2023


Abstract

In chemistry, reasoning about the underlying mechanisms of observed phenomena lies at the core of scientific practices. The process of uncovering, analyzing, and interpreting mechanisms for explanations and predictions requires a specific kind of reasoning: mechanistic reasoning. Several frameworks have already been developed that capture the aspects of mechanistic reasoning to support its formative assessment. However, evaluating mechanistic reasoning in students’ open responses is a time- and resource-intense, complex, and challenging task when performed by hand. Emerging technologies like machine learning (ML) can automate and advance the formative assessment of mechanistic reasoning. Due to its usefulness, ML has already been applied to assess mechanistic reasoning in several research projects. This review focuses on 20 studies dealing with ML in chemistry education research capturing mechanistic reasoning. We developed a six-category framework based on the evidence-centered design (ECD) approach to evaluate these studies in terms of pedagogical purpose, rubric design, construct assessment, validation approaches, prompt structure, and sample heterogeneity. Contemporary effective practices of ML-based formative assessment of mechanistic reasoning in chemistry education are emphasized to guide future projects by these practices and to overcome challenges. Ultimately, we conclude that ML has advanced replicating, automating, and scaling human scoring, while it has not yet transformed the quality of evidence drawn from formative assessments.


Introduction

Across all sciences, mechanistic reasoning is a critical thinking skill (Machamer et al., 2000; Glennan, 2002; Russ et al., 2008; Bolger et al., 2012; Illari and Williamson, 2012; Southard et al., 2016; van Mil et al., 2016). However, students experience significant challenges in generating mechanistic explanations in science subjects (cf., Bachtiar et al., 2022). Research in chemistry education has as well documented that these challenges include reasoning about chemical mechanisms in organic (cf., Graulich, 2015; Dood and Watts, 2022a, 2022b) or general chemistry (cf., Talanquer, 2009; Sevian and Talanquer, 2014) since many cognitive operations must co-occur. For example, when solving mechanistic problems, students must reason about submicroscopic processes, think of multiple influences, and weigh different variables (Machamer et al., 2000; Russ et al., 2008). Due to the importance of all those skills for a profound chemical understanding, a major goal of chemistry education is supporting students in constructing causal, mechanistic explanations (Cooper, 2015). Consequently, teaching needs to provide opportunities for students to reason about how and why mechanisms occur to emphasize the importance of a process-oriented understanding (Cooper et al., 2016).

However, students will only construct such explanations and develop the desired, cognitively demanding reasoning skills if these skills are regularly assessed (Stowe and Cooper, 2017; Stowe et al., 2021; DeGlopper et al., 2022). Traditional chemistry assessment tends to emphasize rote learning strategies, such as recall, application of simple algorithms, and pattern recognition (Stowe and Cooper, 2017). For this reason, students may rely on memorization in chemistry classes (Grove and Lowery Bretz, 2012). Accordingly, instructors need tools for formative assessment that capture the quality of mechanistic reasoning. Such tools can be assessment items that allow students to explain how and why mechanisms occur (Cooper et al., 2016) so that instructors can regularly assess and foster the desired reasoning skills.

Developing high-quality science assessments that elicit an extensive set of competencies as well as content knowledge and give adequate feedback is challenging in daily teaching practices (Songer and Ruiz-Primo, 2012; Pellegrino, 2013; Pellegrino et al., 2016). Creating valid and reliable tools for drawing evidence-based inferences about mechanistic reasoning and individual support needs is particularly challenging. So far, closed-response questions, for example, single- or multiple-choice items, have often been the standard type of assessment in large classrooms. These items can be evaluated quickly, but the forced choice comes with validity threats (Birenbaum and Tatsuoka, 1987; Kuechler and Simkin, 2010; Lee et al., 2011). Students may just guess the correct answer, preventing future learning progression. Additionally, closed-response questions cannot assess students’ normative scientific and naïve ideas as accurately as human- or computer-scored open-ended items (Beggrow et al., 2014).

Hence, science education research (e.g., Haudek et al., 2019) calls for the application of open-ended assessments such as constructed responses, essays, simulations, educational games, and interdisciplinary assessments since these formats may promote detailed explanations of chemical phenomena. Until now, resource constraints like workload, costs, and time for human evaluation have often prevented teachers and faculty members from using such formative assessments regularly in large-enrollment courses. Machine learning (ML) can be used to increase the application of formative open-ended assessments (cf., Zhai et al., 2020a, 2020c) because it offers new approaches for capturing student understanding, facilitating immediate decision-making as well as action-taking (Zhai, 2021). Notably, ML eases human effort in scoring, improves approaches to evidence collection, and has great potential in tapping manifold constructs (Zhai, 2019), which enables more reasoning-focused assessment and teaching (for more information on ML terms, see the glossary in Appendix).

The primary goal of this review is to illustrate the advancements and shortcomings that ML has brought to the formative assessment of mechanistic reasoning. For that purpose, we discuss six interdependent categories: pedagogical purpose of ML application, rubric design, construct assessment, validation approaches, prompt structure, and sample heterogeneity. We emphasize the importance of these categories for designing and implementing formative assessments and provide an in-depth analysis of how mechanistic reasoning is captured in ML-based chemistry assessments. Three questions guide the objectives of this review: First, what contemporary effective practices can be deduced from current research projects? Second, how did the selected research projects already advance the field of ML-based formative chemistry assessment capturing mechanistic reasoning? Third, what are the shortcomings in applying ML to formative chemistry assessment capturing mechanistic reasoning? After analyzing the selected studies, we suggest implications for the future implementation of ML in formative assessments to extend assessment functionality.

Theoretical considerations

Evidence-centered design approach

The discussion of ML-based formative assessment is, in this review, anchored in the evidence-centered design (ECD) approach to educational assessment (e.g., Mislevy et al., 2003a, 2003b; Mislevy, 2006; Mislevy and Haertel, 2007; Rupp et al., 2012; Riconscente et al., 2015; Pellegrino et al., 2016; Kubsch et al., 2022b). ECD offers a conceptual framework for assessment practices in which evidence is gathered and interpreted in coherence with the intended purpose of the assessment (Mislevy et al., 2003a). Following this, ECD considers assessments as a theory-guided process of evidentiary reasoning aiming at drawing coherent conclusions from the things students say, write, or make to their competencies (Mislevy et al., 2003a; Mislevy and Haertel, 2007; Mislevy, 2016; Pellegrino et al., 2016). Drawing such evidentiary conclusions helps bring the expectations about students’ domain-specific competencies in line with the observations needed to evidence them (Glaser et al., 1987; Mislevy et al., 2003a).

Influenced by these considerations, three highly interrelated spaces are highlighted in the ECD (cf., Fig. 1): the claim, evidence, and task space (Pellegrino et al., 2016; Kubsch et al., 2022b). The claim space addresses the objectives of an assessment, i.e., the competencies that should be evaluated through the assessment (Messick, 1994). Consequently, before designing an assessment, one needs to analyze the targeted cognitive domain to make a precise claim about students’ competencies within this domain (Rupp et al., 2012; Mislevy, 2016; Pellegrino et al., 2016; Kubsch et al., 2022b). Making a claim includes unpacking the fine-grained, domain-specific characteristics, analyzing their connections, specifying how they contribute to the acquisition of essential competencies, and arranging all competencies in an order with increasing complexity (Pellegrino et al., 2016). In the context of mechanistic reasoning, several frameworks have been developed that characterize the specificities of this complex type of reasoning (Machamer et al., 2000; Russ et al., 2008; Kraft et al., 2010; Sevian and Talanquer, 2014; Cooper et al., 2016; Caspari et al., 2018a; Krist et al., 2019; Dood et al., 2020a; Raker et al., 2023; Yik et al., 2023). Although all these frameworks focus on the dynamic interplay of entities and activities a scalar level below the phenomenon, they concentrate on different conceptualizations and elaborateness of mechanistic reasoning (cf., Dood and Watts, 2022a). Most frameworks differentiate between forms of reasoning that capture, among others, descriptive, causal, and mechanistic reasoning (Machamer et al., 2000; Russ et al., 2008; Cooper et al., 2016; Caspari et al., 2018a). Descriptive reasoning entails identifying the entities of a mechanism as well as their explicit properties without necessarily outlining the cause for the underlying processes. Such reasoning is more focused on describing the system as a whole (Sevian and Talanquer, 2014; Dood et al., 2020a; Raker et al., 2023; Yik et al., 2023). Causal reasoning involves understanding cause-and-effect relationships between different variables and why they contribute to the outcome of a chemical phenomenon. Causal reasoning is, therefore, concerned with understanding principles or factors that are driving a reaction, rather than the specific steps or processes involved (diSessa, 1993; Carey, 1995; Russ et al., 2008; Cooper et al., 2016). Mechanistic reasoning focuses on understanding how a chemical reaction proceeds at the molecular level, including the mechanistic steps that convert the reactants into the products. So, mechanistic reasoning aims to explain electronically, how underlying processes occur (Machamer et al., 2000; Russ et al., 2008; Cooper et al., 2016; Caspari et al., 2018a, 2018b; Krist et al., 2019).


image file: d2rp00287f-f1.tif
Fig. 1 Simplified representation of the evidence-centered design (ECD) approach, including the claim, evidence, and task space, the guiding questions central to each space, as well as the six machine learning-related categories derived from ECD.

After reflecting about the claim space, evidence statements that outline the performance accepted as evidence must be composed (Mislevy, 2016; Pellegrino et al., 2016; Kubsch et al., 2022b). Formulating evidence statements involves specifying the desired competencies in detail and setting a framework for the interpretation of this evidence (Kubsch et al., 2022b). Eliciting evidence of mechanistic reasoning is a challenging, iterative procedure that requires multiple refinements of the task model and a repeated evaluation of students’ responses (Cooper et al., 2016; Noyes et al., 2022). In particular, a coding scheme should be developed based on expected and actual answers to outline how a high level of reasoning sophistication could look like (Cooper, 2015). Here, evidence rules (Mislevy et al., 2003a; Rupp et al., 2012) define how observable aspects, for instance, written phrases in students’ work products about a mechanism's plausibility, contribute to students’ competence in mechanistic reasoning. Measurement models, in turn, specify the method for making diagnostic inferences about the examined construct (Rupp et al., 2012), e.g., how phrases in students’ written accounts are coded as evidence of mechanistic reasoning.

Coherently defining evidence rules and measurement models helps develop a construct-centered task model that characterizes the form of tasks needed to collect evidence about students’ competencies (Pellegrino et al., 2016). Rather than representing a single task, task models characterize the central features that a set of potential tasks must possess as well as the elements that could be varied across different assessment contexts (Mislevy et al., 2003a; Mislevy and Haertel, 2007; Kubsch et al., 2022b). Besides defining these contexts, a task model for capturing mechanistic reasoning must find a compromise between initiating as much sophisticated reasoning as possible without providing too much information on the structure and content of a high-quality response (Noyes et al., 2022). Structuring prompts adequately and specifying the degree of scaffolding is, thus, key in developing task models (Caspari and Graulich, 2019; Graulich and Caspari, 2020; Noyes et al., 2022), for instance, that guide students in judging the plausibility of a reaction mechanism (Lieber and Graulich, 2020, 2022; Lieber et al., 2022a, 2022b; Watts et al., 2022). Consistently connecting claim, evidence, and task space ultimately enables the design and use of formative assessments to be better aligned with the construct of interest.

Compared to previous methods, ML has great potential in automatically evaluating mechanistic reasoning since these algorithms can learn from previous experience without explicit programming (Samuel, 1959). For detecting mechanistic reasoning with supervised techniques, human raters have to consistently label evidence of mechanistic reasoning in students’ responses according to a literature-based or inductively derived coding scheme (Raker et al., 2023). Afterward, the selected algorithm can be trained with the labeled data, so that it can decipher the underlying patterns. With this approach, it is not necessary to explicitly program an algorithm that, for example, distinguishes between a chain of phenomenological links and mechanistic reasoning. The level of causality, i.e., different types of reasoning, can simply be determined by training the algorithm with coherently labeled data and letting the algorithm learn from these patterns (Dood et al., 2018, 2020a; Noyes et al., 2020). In doing so, even subtleties in mechanistic reasoning can be identified if heterogenous and coherently labeled training data is available (Watts et al., 2023).

ML-related categories derived from ECD

Aligned with the claim, evidence, and task space, six categories can be defined which are present in many studies – not only in those which deal with the formative assessment of mechanistic reasoning (cf., Fig. 1): pedagogical purpose of ML application, rubric design, construct assessment, validation approaches, prompt structure, and sample heterogeneity. The pedagogical purpose of ML application articulates the pedagogical benefits of using ML in science assessment. This category can be located in the claim space because, first, it must be asked for which pedagogical purpose an assessment is to be used. The categories rubric design, construct assessment, and validation approaches are consistent with the evidence space since they correspond to the evidence rules and measurement models outlined above. More specifically, rubric design refers to the definition of the fine-grained characteristics accepted as evidence. Construct assessment describes the algorithmic model that translates the input data into diagnostic details about students’ competencies, which means that this category builds the bridge between the technological implementation and diagnostic granularity of ML-based formative assessment. Moreover, validation approaches characterize the method used for validating the reliability of an algorithm. Last, prompt structure and sample heterogeneity are assigned to the task space. Prompt structure specifies the tasks that elicit the desired evidence about students’ competencies. Sample heterogeneity refers to the sample responding to this task; here, a heterogenous sample is a prerequisite for building equitable ML models. Considering the six categories in the development of formative assessments helps translate cognitive theories into assessments that yield evidence concerning the construct of interest (Pellegrino et al., 2016). The ECD approach adapted for ML-based formative assessment (called ML-adapted ECD hereafter) will serve as a theoretical underpinning for the evaluation of the selected studies in this review.

Another framework conceptualizing the implementation of ML in science assessment in general, developed by Zhai et al. (2020a), considers the dimensions of construct, automaticity, and functionality. Some cross-references between the ML-adapted ECD and Zhai et al.'s (2020a) framework can be made (cf., Fig. 2). Zhai et al.'s (2020a) dimension construct is covered by our categories rubric design and prompt structure. For automaticity, Zhai et al. (2020a) introduced, among others, the variables machine extracts attributes and model generalization. Machine extracts attributes is reflected in our category construct assessment because both deal with the fundamental methodology of an ML algorithm. Zhai et al.'s (2020a)model generalization analyzes whether an ML algorithm trained with one dataset can be applied to another dataset. Building upon this distinction, we used the category sample heterogeneity. For functionality, Zhai et al. (2020a) established, among others, the variables score use and measurement models. The variable score use evaluates whether the ML output is embedded in a learning activity, which we highlight in pedagogical purpose of ML application. Last, Zhai et al.'s (2020a) variable measurement models corresponds to our category validation approaches since both refer to the method used for the validation of an algorithm.


image file: d2rp00287f-f2.tif
Fig. 2 Mapping the six categories of the machine learning-adapted evidence-centered design to the three dimensions of Zhai et al.'s (2020a) framework.

Multi-hierarchical ML workflow for capturing students’ mechanistic reasoning

Based on the ML-adapted ECD and the six categories identified in the claim, evidence, and task space (cf., Fig. 1), we further divided each category into three hierarchical levels to describe the extent to which the selected studies applied the ECD principles (cf., Fig. 3). These hierarchical levels may help put ML-adapted ECD into practice and characterize the advancements and shortcomings that ML has brought to the formative assessment of mechanistic reasoning. Besides, it may serve as a tool to evaluate ongoing research processes. We illustrate these categories and levels first with research approaches in other domains that can also be applied to chemistry assessment to show the generalizability and topicality of the ML-adapted ECD principles in science assessment.
image file: d2rp00287f-f3.tif
Fig. 3 Representation of the workflow for capturing complex constructs with machine learning (ML) based on the evidence-centered design (ECD) approach. Three hierarchical levels for the degree of implementing the ECD principles when applying ML are proposed in each category.

Claim space

The claim space comprises the category pedagogical purpose of ML application that covers primarily how to integrate ML into learning environments. The pedagogical purpose sets the basis for applying ML in educational assessments since it specifies the objectives of the respective assessment.
Pedagogical purpose of ML application. A key component of applying ML in science education is verifying the purpose of the designed assessment, which leads to the pedagogical benefits of ML. Applying ML for the automated analysis of responses (level 1) helps evaluate students’ proficiency in real-time. This ML application can be considered the first level since the embedding of ML in learning activities (level 2) is linked to higher achievement in science assessment (cf., Zhai et al., 2020c). Easily accessible embeddings eventually provide an opportunity for ML-guided adaptive learning (level 3) through personalized feedback or tailored exercises (Kerr, 2016).

To exemplify what we mean by a well-planned pedagogical purpose of ML application in terms of adaptive learning, we delineate the studies of Zhu et al. (2017), Mao et al. (2018), and Lee et al. (2021). The authors investigated students’ scientific argumentation skills with a focus on the linkage between claim and evidence, the limitations of the given data, and the factors that cause uncertainty. By doing so, they developed an ML-based instructional system that provided automatically individualized real-time feedback for students so that they had the chance to revise their arguments accordingly. Zhu et al. (2017) found that most students revised their arguments about factors that affect climate change after getting adapted feedback. Consequently, students who revised their arguments had significantly higher final scores than those who did not adjust their arguments.

Lee et al. (2021) built upon the work of Zhu et al. (2017) and Mao et al. (2018) by applying ML in three simulation-based scientific argumentation tasks. In addition to argumentation feedback, Lee et al. (2021) provided ML-enabled simulation feedback to improve students’ interactions with simulations about groundwater systems. Students who received adapted simulation feedback were more likely to re-run the simulation, leading to significantly higher scores in some tasks. This result shows that ML algorithms can provide effective individualized feedback not only for written arguments but also for simulation interactions. So, ML can be used to ease the implementation of adaptive learning.

Evidence space

The evidence space contains the three categories rubric design, construct assessment, and validation approaches dealing with the characteristics accepted as evidence for a construct, the way these characteristics are analyzed, and the methods used for the validation of the analysis. The evidence space connects the claim and task space which ensures a valid evaluation of the investigated construct.
Rubric design. In any type of assessment, validity is a prerequisite. Notably, a computer algorithm can only capture reasoning validly if humans can. So, valid rubrics need to be designed by students’ responses or the literature. Analytic rubrics measure the presence or absence of multiple binary conceptual components. Here, every analytic bin outlines a single idea, and every response must be coded for each of the non-mutually exclusive bins. For a more comprehensive evaluation of student reasoning, holistic rubrics can be applied since they measure the characteristics of students’ responses by assigning a single, mutually exclusive score. Analytic and holistic rubrics are interrelated. On the one hand, a holistic rubric can be derived by analyzing the combinations of the analytic bins (Noyes et al., 2022). On the other hand, holistic rubrics can be deconstructed into analytic bins (Jescovitch et al., 2019b). Integrating both types eventually has the advantage that students’ responses can be evaluated at an overarching and a fine-grained level.

In their study on scientific argumentation, Jescovitch et al. (2021) and Wang et al. (2021) compared a multi-level, holistic coding approach with a dichotomous, analytic coding approach. Both found that ML models trained with analytic bins performed slightly better than those trained with a holistic rubric, especially for items that introduced a complex scenario. They inferred that analytic rubrics might have the potential to identify and unpack additional complexity in students’ responses, which can be explained by the reduced human coding effort, resulting in an increase in human–human inter-rater reliability and an improvement of the model performance (Jescovitch et al., 2019a, 2021). However, there is no evidence that one coding approach outperforms the other in every context for human and machine coding (Zhai et al., 2020c). For this reason, more important than the rubric type is that rubrics allow for consistent classifications of data, both within and between human coders (Raker et al., 2023).

For the application of ML, fine-grained rubrics seem to offer deeper insights into student reasoning. Therefore, we classify binary rubrics as level 1, multi-level rubrics as level 2, and multi-level rubrics representing different levels of reasoning sophistication as level 3 in the ML-adapted ECD. Rubrics classified as level 2, in general, differentiate between incorrect, partially correct, and correct responses, whereas rubrics classified as level 3 distinguish between different levels of reasoning sophistication. In all cases, a valid rubric and an iterative process of its application and refinement are necessary to elicit the construct of interest (Allen and Tanner, 2006).

Construct assessment. Applying ML in science assessment offers a great opportunity to automatically evaluate complex constructs and detect hidden patterns. Before designing an assessment, it is necessary to decide whether supervised or unsupervised techniques should be applied to analyze the dataset (cf., Appendix). The magnitude of available data, the explored phenomena, and the generated hypotheses guide the ML model selection (Mjolsness and Decoste, 2001). In general, no algorithm outperforms any other in every context (Zhai et al., 2020c).

For example, lexical analysis software can automatically extract domain-specific categories that are iteratively refined by human experts afterward (Haudek et al., 2011). Defining meaningful categories can, consequently, be a major task in applying lexical analysis (Urban-Lurain et al., 2010). Categories should offer enough information to detect different sophistication levels; however, infrequently used categories have less predictive power. With statistical methods, it is, thus, possible to identify meaningful categories and collapse non-discriminatory ones. Sometimes, categories are formed based on human- or computer-set if-then commands, which come with limitations. On the one hand, case-specific custom-libraries need to be defined to decide which keywords are included in the if-then commands and how fine-grained the categories should be, which is labor- and time-intense. On the other hand, if-then commands do not allow for flexible adjustments of the categories over time. So, we define construct assessment based on if–then commands as the first level in this category.

Rather than just running human- or computer-defined commands, ML models can also gain experience on their own (Bishop, 2006). Once the algorithm has learned from the training data, it can classify or predict the characteristics of new data (cf., Appendix). For such ML algorithms, features can, for instance, be extracted with n-grams. These n-grams are sequences of n words that allow for identifying repetitive patterns, i.e., n consecutive words in students’ responses. Besides word n-grams, ML algorithms can also process character n-grams, response length, syntactic dependencies, and semantic role labels (e.g., Mao et al., 2018). These methods require less human effort, especially when a reliable scoring rubric has already been applied to many student responses (Haudek et al., 2011). However, the mentioned natural language processing techniques are built on the simplified assumption that the word order is irrelevant to the meaning of a sentence (Wulff et al., 2022b), which complicates the detection of implicit semantic embeddings. So, traditional ML models are only sensitive to key conceptual components, which is why we define construct assessment based on shallow learning experiences as the second level of the ML-adapted ECD. Here, the term experience refers to the algorithm's process of collecting information from pre-processed data to make automated decisions.

Other techniques like Bidirectional Encoder Representations from Transformers (BERT), a natural language processing technique for calculating embedding vectors, may assess the relationships between words more comprehensively since it knows contextualized embeddings (Wulff et al., 2022a). In other words, BERT is capable of grasping the implicit meaning of individual words, understands dynamic contextualized relationships of a word within any other word in a sentence, and recognizes filler words. Since BERT processes all words of a sentence simultaneously, it can detect both conceptual and semantic features, which is why we consider construct assessment based on contextualized learning experiences as the third level in this category.

Validation approaches. The primary metric for the accuracy of an ML model is the agreement between human and machine scores (Williamson et al., 2012). For a high machine-human score agreement, a high human–human inter-rater reliability in the coding process is necessary. Since supervised models are built on human codes, cases in which humans did not reach a consensus may also induce scoring problems for the algorithm. Hence, it seems helpful to feed only data where consensus is reached in the algorithm.

Three methods can be applied to calculate the machine-human score agreement: self-validation, split-validation, and cross-validation (cf., Appendix). Nehm et al. (2012) found that their machine-human score agreement slightly decreased when validating their algorithms with a new testing set compared to the agreement reached when using the same dataset for model training and validation. In their systematic review, Zhai et al. (2020c) could underpin this result by finding that cross-validation yield better machine-human score agreements than self- and split-validation. However, using self-validation brings equity and generalizability constraints since the same data is used for the training and evaluation of the algorithm (Zhai et al., 2020a, 2020c). For this reason, we classify self- and cross-validation approaches as the first and split-validation approaches as the second level in the ML-adapted ECD. For educational purposes, the transparency of ML-based decisions should, additionally, not be underestimated (Cheuk, 2021). Besides calculating the level of agreement, it is helpful to understand why the respective model decisions were made by comparing human- and computer-set clusters, interpreting misclassifications, or conducting interviews. Analyzing ML model decisions is, therefore, the third level in this category.

Irrespective of the validation approach, various metrics are used to measure the magnitude of machine-human score agreement (cf., Zhai et al., 2020c). For a more consistent indication of machine-human score agreement, Zhai et al. (2021) call for the introduction of standardized rules, e.g., to report a confusion matrix, accuracy, Cohen's κ, and the relevant variance.

Task space

The task space consists of two categories prompt structure and sample heterogeneity defining the task model used for capturing evidence of mechanistic reasoning as well as the sample responding to that task. The considerations of the task space help practically implement science assessments.
Prompt structure. When developing a prompt, a precise wording should clearly communicate the concepts that sophisticated reasoning needs to address. Comparing the so-initiated students’ ideas with an expected answer helps adjust the prompt iteratively. When modifying a prompt, one has to keep in mind that small changes in the wording can drastically change the way students respond (Noyes et al., 2022). A prompt has to offer enough information for students to work with a task. If a prompt is too vague, students may not be engaged to reason in a sophisticated way. In contrast, a too specific prompt may give students detailed information on what the answer requires. This may lead to a repetition of key terms, phrases, and ideas denoted in the question stem. Preparing a well-elaborated prompt is, following this, crucial to ensure that the assessment captures students’ understanding, not the repetition of buzzwords.

During the prompt refinement process, one has to consider not only the wording but also the structure of a prompt in terms of scaffolding (e.g., Kang et al., 2014). Scaffolding, for example, combined with contrasting cases (Graulich and Schween, 2018), helps focus students’ attention on productive concepts (Wood et al., 1976) and activate resources that might otherwise be overlooked (Hammer, 2000). When adjusting the scaffolding, one has to consider the resources that students have already applied in their reasoning, the parts of the task model that have activated those resources, and the modifications that can be implemented to activate further appropriate resources (Noyes et al., 2022). In ML-based science assessment, scaffolding can also be realized by providing multiple response boxes in which students can enter their answers (Urban-Lurain et al., 2010). This creation of separate text boxes has two major benefits. On the one hand, multiple boxes provide students with an easily accessible form of scaffolding since these boxes may help students structure their explanations as desired. On the other hand, multiple response boxes facilitate automated text analysis as shown by Urban-Lurain et al. (2010). In their study, a single text box was provided to investigate student thinking on acid–base chemistry. They realized that the students used similar terms in this box but with entirely different ideas in mind. Due to the similar wording, ML models could neither elicit the context to which the technical terms referred nor students’ cognitive concepts. The introduction of text boxes for different prompts helped them to overcome this limitation.

Designing a prompt to capture mechanistic reasoning is, considering all these aspects, a complex process requiring a detailed analysis of the phenomenon under investigation and of the competencies that students should acquire. Simply asking students to explain how and why phenomena occur, without providing comprehensive support, is often not sufficient to initiate sophisticated reasoning (Cooper, 2015). Scaffolded prompts can better guide students in activating their resources as the expectations of what a good answer should look like are more clearly outlined and assessable afterward. Based on the known effectiveness of scaffolded prompts, we define the levels unspecific (level 1), specific (level 2), and specific, scaffolded (level 3) in the category prompt structure. We understand by unspecific prompts questions that can be asked regardless of the subject domain, such as “Explain your reasoning”. Specific prompts provide better orientation for students on how to solve a task. Due to their potential instructional support on how to structure a response, specific, scaffolded prompts comprise the third level in this category.

Sample heterogeneity. Because language background influences how mechanistic reasoning is expressed (Deng et al., 2022), the training of an ML model ideally requires a heterogeneous data sample. Responses from students with different language backgrounds should be fed into the algorithm to accurately assess various reasoning types. Following this, the heterogeneity of answers is key to developing ML models that can be applied in different learning environments, independent of institution, course, curriculum, and question prompt.

Some studies on ML in science education already emphasized the importance of creating equitable ML models. Ha et al. (2011) investigated whether an ML model developed using biology students’ written explanations of evolutionary change from one institution can score responses from another institution. They found that their predictive model could accurately score responses at both universities for most key concepts; however, the model performance slightly decreased for a few concepts at the second university. The partly lower accuracy of the ML model was explained with unique language patterns, low concept frequencies, misspellings, and informal vocabulary. Liu et al. (2016) analyzed differential effects concerning gender, native language, and use of a computer for homework and found no generalizable differences between human and machine scoring across all items for any of these variables. Similar to Ha et al. (2011), Liu et al. (2016) traced minor differences in human and machine scoring back to misspellings and linguistic diversity. These findings led Ha and Nehm (2016) to examine the influence of misspelled words on machine scoring. Although students who learned English as an additional language produced significantly more misspellings, this did not meaningfully impact computer scoring efficiency. For a valid examination of these differential effects, a heterogeneous data sample is required (Cheuk, 2021).

Putting it all together, considerable heterogeneity of responses is needed to investigate the generalizability of an ML model, which means that data must cover all sophistication levels. If specific words, terms, or phrases are missing from the training data, the algorithm cannot predict them in the testing data. This prerequisite is reflected in the ML-adapted ECD (cf., Fig. 3). A low degree of heterogeneity (level 1) may produce a biased ML model, since students who meet specific external characteristics may be favored, whereas students who do not meet these characteristics may be disadvantaged. By low degree of heterogeneity, we understand certain external characteristics that tend to be identical across the investigated sample such as demographic background, instructor, and curricula. A high degree of heterogeneity (level 2) enlarges the training vocabulary of the ML model and increases equity by considering different demographic backgrounds, instructors, and curricula. Investigating differential effects (level 3) helps finally uncover and prevent bias.

Selection of studies

This review investigates the state-of-the-art in the area of ML-detected mechanistic reasoning. We aim to provide an in-depth analysis of the prospects and obstacles that ML techniques have brought to the evaluation of this special type of reasoning, rather than to give a broad overview of applying ML in chemistry or science education. Please, see Zhai et al. (2020a, 2020c) for a review or a framework, respectively, on applying ML in science assessment, Deeva et al. (2021) for a review on automated feedback systems, and Gerard et al. (2015) for a meta-analysis on the effectiveness of automated, adaptive guidance.

We defined a set of keywords to look for studies that contributed to the automated analysis of mechanistic reasoning in chemistry. After an iterative process of refinement, the search term chemistry AND (“machine learningORlexical analysisORautomated guidanceORautomated feedbackORautomated text analysis”) was established. We then used the three databases Google Scholar, ProQuest, and ERIC to find suitable articles. Due to the immense amount of returned results (44[thin space (1/6-em)]600 in total), we modified where the search terms must occur. We specified that the keywords must occur in the title when using Google Scholar, in the abstract when using ProQuest, and anywhere in the article when using ERIC. Eventually, the search process resulted in 996 articles (333 Google Scholar, 640 ProQuest, 23 ERIC). First, we screened the title of these papers to sort all articles related to the field of chemistry but not to chemistry education. After that, we looked closely at the abstracts to examine whether the articles contributed to our area of interest. Additionally, we defined some including and excluding criteria that the reviewed articles must comply with.

The included studies had to meet the following criteria:

• Topic of research: research studies must focus on ML's application in chemistry education research. The participants in the studies did not have to be chemistry students exclusively.

• Goal of research: research studies must contribute to the analysis of mechanistic reasoning. In this review, mechanistic reasoning generally refers to reasoning about the underlying mechanisms of observed phenomena, i.e., general or organic chemistry phenomena. For a study to be included, it was not sufficient that chemistry content or reasoning about macroscopic phenomena was addressed. Included studies must capture reasoning about underlying mechanisms, i.e., how and why phenomena occur.

• Time range of publications: since the first paper meeting the defined selection criteria was published in 2009, we considered all articles since 2009.

The following criteria led to the exclusion of a study:

• Studies concentrating on science assessments with predominantly non-chemical content are excluded (e.g., Liu et al., 2016; Zhai et al., 2022a).

• Studies analyzing inquiry or modeling skills in drawing activities are excluded (e.g., Rafferty et al., 2013, 2014; Gerard et al., 2016; Zhai et al., 2022b).

• Studies focusing on the automated analysis of inquiry skills in designing and conducting experiments are excluded (e.g., Gobert et al., 2013, 2015; Sao Pedro et al., 2013).

• Studies using ML to evaluate student-written interpretations of chemical representations are excluded (e.g., Prevost et al., 2014).

• Studies investigating the automated analysis of student-constructed graphs are excluded (e.g., Vitale et al., 2015).

• Studies using written responses to explore methodological issues of ML, e.g., the effect of construct complexity on scoring accuracy (Haudek and Zhai, 2021), are excluded.

• Studies discussing the teacher role in ML application or their pedagogical content knowledge assessed by ML are excluded (e.g., Zhai et al., 2020b).

We screened the reference section of each selected study to find other appropriate papers. The final set of articles contains 20 studies (cf., Table 1). To be noted, Urban-Lurain et al. (2009) summarized parts of Haudek et al.'s (2009) study. Prevost et al. (2012a) document the use of an automatically evaluated item, which is even presented in a more extensive study by Prevost et al. (2012b). Dood et al. (2018) report the development of a predictive model for Lewis acid–base concept use, whereas Dood et al. (2019) focus on the pedagogical benefits of their model. Similarly, the pedagogical use of Dood et al.'s (2020a) computerized scoring model is reported in the study by Dood et al. (2020b). In this review, we refer up to now, if possible, to Haudek et al. (2009), Prevost et al. (2012b), Dood et al. (2018), and Dood et al. (2020a).

Table 1 Overview of the selected research articles
Author Software Method Education level Domain Science practice Item Sample sizes
Note: Haudek et al. (2009) and Urban-Lurain et al. (2009), Prevost et al. (2012a, 2012b), Dood et al. (2018, 2019), and Dood et al. (2020a, 2020b) refer to the same predictive model, which is why they are represented within a single row. BERT = Bidirectional Encoder Representations from Transformers, CR = constructed response, SI = simulation, FD = flow diagram, WTL = writing-to-learn.
Haudek et al. (2009), Urban-Lurain et al. (2009) SPSS text analytics for surveys Category-based lexical analysis Under graduate Thermodynamics, acid–base chemistry Explanation CR 158/153/382
Haudek et al. (2012) SPSS text analytics for surveys Category-based lexical analysis Under graduate Acid–base chemistry Explanation CR 1172/323
Prevost et al. (2012a, 2012b) SPSS text analytics for surveys/SPSS modeler text analytics Category-based lexical analysis Under graduate Thermodynamics Explanation CR 168/329/386
Liu et al. (2014) c-Rater Natural language processing Middle school Thermodynamics Explanation CR 412/362/321/356
Donnelly et al. (2015) c-Rater Natural language processing Middle school Thermodynamics Inquiry CR, SI 346
Haudek et al. (2015) SPSS modeler text analytics Category-based lexical analysis Under graduate Acid–base chemistry Explanation CR 336
Vitale et al. (2016) c-Rater-ML Support vector regressor Middle school Climate change Inquiry CR, FD, SI 283
Tansomboon et al. (2017) c-Rater-ML Support vector regressor Middle school Thermodynamics Inquiry CR, SI 482
Dood et al. (2018, 2019) SPSS modeler text analytics Category-based lexical analysis Under graduate Acid–base chemistry Explanation CR 752/1058
Haudek et al. (2019) R/RStudio Ensemble algorithm Middle school Structures and properties Argumentation CR 246/775/763
Dood et al. (2020a, 2020b) SPSS modeler text analytics/python Category-based lexical analysis Post-secondary Nucleophilic substitution Explanation CR 1041
Noyes et al. (2020) R/RStudio Ensemble algorithm Under graduate Intermolecular forces Explanation CR 1730
Maestrales et al. (2021) AACR web portal Ensemble algorithm High school Structures and properties Explanation CR 26[thin space (1/6-em)]800
Winograd et al. (2021b) Python BERT/convolutional neural network Under graduate Chemical equilibrium Communication WTL 297
Yik et al. (2021) R/RStudio Support vector machine Post-secondary Acid-base–chemistry Explanation CR 852
Watts et al. (2023) Python Convolutional neural network Under graduate Racemization, acid hydrolysis, Wittig reaction Explanation WTL 771


Discussion of the selected studies on ML-based formative assessment of mechanistic reasoning

Based on the defined levels, we categorized the 20 selected articles. First, author PPM assigned each article in each of the six categories to one of three hierarchical levels. After discussing the categorization with author NG, we refined it until a total consensus was reached (cf., Table 2). Based on this categorization, we illustrate some contemporary effective practices of ML-based formative assessment of mechanistic reasoning, i.e., studies that were assigned to the third level of the ML-adapted ECD in the respective category (cf., Fig. 3). After that, we deduce trends as well as advancements and shortcomings that ML has brought to the formative assessment of mechanistic reasoning.
Table 2 Levels of the selected studies according to our framework. The left-aligned, orange dots indicate level 1, the centered, blue dots indicate level 2, and the right-aligned, green dots indicate level 3


Pedagogical purpose of ML-detected reasoning

The pedagogical purpose of most studies was focused on automating and scaling the analysis of constructed responses in large classes. Thirteen of the reviewed studies (cf., Table 2) set automating human scoring (level 1) as their pedagogical purpose. Here, these studies used the ML output to scan the number of responses in each rubric category, which helped predict students’ competence. However, these studies concentrated less on investigating the pedagogical benefits of ML, but more on the technical and validity features of their assessment.

Dood et al. (2020a, 2020b) developed an adaptive online tutorial to improve reasoning about nucleophilic substitution reactions. Before working with this tutorial, students’ explanations of a unimolecular nucleophilic substitution were automatically scored. Depending on the determined level of explanation sophistication, students were assigned one of two adapted tutorials that addressed leaving group departure, carbocation stability, nucleophilic addition, and acid–base proton transfer. Dood et al. (2020a, 2020b) found that completing the adapted tutorial significantly improved students’ reasoning skills. Their work illustrates how ML can be used to implement aspects of adaptive learning (level 3).

Also marked as contemporary effective practices in the category pedagogical purpose, Donnelly et al. (2015), Vitale et al. (2016), and Tansomboon et al. (2017) used ML techniques to investigate the effectiveness of automated, adaptive guidance in readily accessible online curricula units on thermodynamics and climate change. Donnelly et al. (2015) assigned middle school students to either a revisit condition, where they were prompted to review a dynamic visualization, or a critique condition, where they had to criticize a sample explanation. With automated scoring, they found that combining adapted guidance with either the revisit or critique condition helped students significantly in acquiring knowledge and revising their explanations. Here, both conditions were more effective for low-performing students indicating the importance of adapted guidance. Building upon this work, Tansomboon et al. (2017) found that transparent automated guidance, that explicitly clarified how the ML output was generated, supported low-performing students in revising an explanation better than typical automated guidance. Similarly, Vitale et al. (2016) found that content guidance, that specified missing ideas in students’ responses, resulted in instant learning gains within a simulation, but knowledge integration guidance, that delivered tailored hints, was better suited to enable knowledge transfer. Together, these studies indicate that ML output can serve as a basis for investigating further qualitative and quantitative research questions in science education.

According to these contemporary effective practices, the next steps in ML-based assessment are to explore the benefits of ML models in daily teaching practices, shifting ML-influenced research from replicating and validating human scoring to guiding students in adaptively acquiring reasoning skills. Comprehensively discussing the pedagogical purpose of ML application before designing and using an assessment helps ensure that we take advantage of ML's comprehensive pedagogical benefits.

Rubric design in ML-detected reasoning

When looking at the selected studies in the category rubric design, it is apparent that most of them (cf., Table 2) already used multi-level rubrics representing different levels of reasoning sophistication (level 3). Only three studies used binary rubrics (level 1) to evaluate whether a student applied the Lewis acid–base concept, while three others designed multi-level rubrics (level 2) that distinguished between incorrect, partially correct, and correct responses, also with a focus on acid–base reactions.

To elicit three-dimensional learning (National Research Council, 2012), Maestrales et al. (2021) created a holistic, three-level rubric classifying responses as incorrect, correct, and multi-dimensional correct. By listing the explanatory components expected in each dimension, Maestrales et al. (2021) did not only show what an expert answer should look like but also created a valid foundation for human scoring. Liu et al. (2014), Donnelly et al. (2015), Vitale et al. (2016), and Tansomboon et al. (2017) developed holistic, multi-level knowledge integration rubrics to assess student reasoning in computer-guided inquiry learning (Linn et al., 2014). By distinguishing between off-task responses, scientifically non-normative ideas, unelaborated links, full links, and complex links, these rubrics valued students’ skills in generating ideas, applying interdisciplinary concepts, and gathering evidence (Linn and Eylon, 2011). Similarly, Dood et al. (2020a) and Noyes et al. (2020) generated holistic, three-level rubrics to categorize student thinking about substitution reactions and London dispersion forces, respectively. Dood et al. (2020a) distinguished between responses that described what happens in a reaction, why the reaction occurs at a surface level, and why the reaction occurs at a deeper level. Noyes et al. (2020) differentiated between non-electrostatic, electrostatic causal, and causal mechanistic explanations.

Just like Dood et al. (2020a) and Noyes et al. (2020), Watts et al. (2023) analyzed fine-grained reasoning features, that were in alignment with Russ et al.'s (2008) framework for mechanistic reasoning, to automatically analyze the how and why of mechanisms. In contrast to the studies mentioned above, Watts et al. (2023) applied an analytic coding approach. With convolutional neural networks they identified the presence of different features of mechanistic reasoning, e.g., identifying activities. Ultimately, their convolutional neural network provided a means to distinguish between explicit and implicit as well as static and dynamic explanations. Winograd et al. (2021b) and Watts et al. (2023) share the fine-grained analytic coding approach; however, Winograd et al.'s (2021b) approach differs in the way that four dimensions of cognitive operations (Grimberg and Hand, 2009), namely consequences, cause and effect, deduction, and argumentation, were evaluated. Winograd et al. (2021b) argue that causal reasoning includes consequences, deduction, and cause and effect; subsequently, the identification of these three operations may help detect instances of causal reasoning.

Haudek et al. (2019) combined analytic and holistic coding approaches when assessing argumentation skills of middle school students. For human coding, they applied a dichotomous analytic rubric that outlined, for instance, the presence of one claim, one piece of evidence, and three pieces of reasoning. For computer coding, the analytic components were combined into a holistic score indicating whether students formed incomplete, partially complete, or complete arguments about sugar dissolving in water.

Prevost et al. (2012b) and Haudek et al. (2015) chose a completely different approach to design rubrics for reaction spontaneity and acid–base behavior of functional groups, respectively. Based on human- and computer-set categories, both applied k-means clustering, an unsupervised ML technique, to group explanations into mutually exclusive clusters. After clustering, they investigated the key aspects of each cluster by analyzing sample responses to inductively develop rubrics. Their process of inductive rubric development has two benefits: as an unsupervised technique, the application of k-means clustering does not require human coding, which is less time- and cost-consuming. Additionally, rubrics can be built exploratively based on student responses, allowing for more extensive analyses of students’ mechanistic reasoning and for adjusting clusters easily over time.

Overall, most studies in our sample constructed ML algorithms that could accurately evaluate different levels of sophistication, illustrating already a high standard of granularity in rubric design. The high accuracy of the ML-based evaluation in these studies indicates that ML has the potential to adopt fine-grained rubrics for detailed analyses. A prerequisite, however, is that the defined levels are well-distinguishable for humans so that the algorithm can consistently decipher the underlying patterns. In the future, it might be necessary to assess not only the sophistication level but also the correctness of applied concepts for summative purposes as well as students’ mixed ideas. Applying unsupervised ML more extensively helps additionally detect patterns across all responses so that rubrics can be adjusted inductively.

Construct assessment in ML-detected reasoning

To evaluate mechanistic reasoning, different methods have been used so far (cf., Table 2). In the earlier years of ML application in chemistry assessment, studies used feature extraction based on human- and computer-defined if-then commands (level 1), leveraging both supervised and unsupervised techniques. In particular, these studies set up domain-specific categories, e.g., accept protons or sharing electrons (Dood et al., 2018), to classify the use of essential keywords. The technological shortcomings of these studies can be explained by the limited possibilities at that time. The authors had to resort to basic ML techniques since applying complex algorithms was much more complex than today. By contrast, six more recent studies let algorithms learn from shallow experiences (level 2).

Different from these research projects, two studies considered contextualized features (level 3) of written responses more comprehensively. Winograd et al. (2021b) applied pre-trained BERT models to characterize the syntactic dependencies of conceptual components according to different cognitive operations (Grimberg and Hand, 2009) while students reasoned about ocean acidification. Their method can be seen as a starting point for fostering complex reasoning in essays, providing feedback for learners and instructors about applied cognitive operations, and uncovering unproductive learning processes. With a similar goal, Watts et al. (2023) applied convolutional neural networks, a cutting-edge deep learning algorithm, to evaluate how students reason about entities, properties of entities, and activities.

Collectively, the studies show that selecting a method for natural language processing as well as a ML technique is a critical step in designing any assessment. The construct assessment method must be chosen so that detailed diagnostic information can be derived from the ML output, potentially enabling individualized feedback and adapted guidance. Due to the enormous technological progress, cutting-edge algorithms can now be employed without programming from scratch. Compared to the use of ML in earlier years, this opened up the possibility for an extensive evaluation of conceptual and semantic features in text responses.

Validation approaches in ML-detected reasoning

In our sample (cf., Table 2), six studies relied on cross-validation or did not report their validation approach (level 1), while eight used additionally split-validation (level 2). Six other studies even analyzed ML model decisions (level 3) by discussing misclassifications, comparing human- and computer-set clusters, as well as conducting face-to-face interviews.

Maestrales et al. (2021) found that responses including more advanced vocabulary like density or velocity were more likely to be misclassified. The algorithm's difficulties in dealing with advanced vocabulary could be explained by the underrepresentation of these technical terms in the training set. Since students did not frequently apply specific technical terms, the algorithm could not score them correctly. Yik et al. (2021) also investigated the shortcomings of their generalized model for evaluating the Lewis acid–base model use. They examined three reasons for false positive and two reasons for false negative predictions, i.e., cases in which the algorithm was wrong. For instance, responses that included the term electrons without referring to the Lewis acid–base theory led to false positives. Yik et al.'s (2021) identification of misclassifications can inform future research to anticipate the limitations of ML models and to adjust rubric, coding process, and choice of algorithm accordingly.

Some misclassifications can be avoided by using clustering techniques to inductively group students’ responses into mutually exclusive clusters. As outlined in the section Rubric design in ML-detected reasoning, Prevost et al. (2012b) and Haudek et al. (2015) used such an approach, namely k-means clustering, to exploratively group students’ responses based on their conceptual ideas. In this context, they compared the clusters they expected before ML analysis with those identified by the ML algorithm. This human interpretation in conjunction with ML-based analysis finally helped them derive clusters that comprehensively reflected student reasoning.

Haudek et al. (2012), in turn, investigated whether machine-scored written responses reflected students’ mental models by conducting interviews. They figured out that there was no significant difference in the number of acid–base-related categories, e.g., strong base, raise pH, and ionization, identified by an ML algorithm and used in oral interviews. With these categories, Haudek et al. (2012), subsequently, classified responses as incorrect, partially correct, and correct and found that partially correct responses were the most difficult to score accurately. The lack of accuracy was explained by the vagueness of what partially correct answers contained, as human raters interpreted the partially correct level of student sophistication differently; thus, leading to validity bias. To be noted, Donnelly et al. (2015) conducted interviews just like Haudek et al. (2012); however, the purpose of these interviews was not to analyze ML model decisions, which is why this study was not assigned to the third level of the ML-adapted ECD in validation approaches.

Together, the studies indicate that the validity features of ML-based formative assessment seem to be well-researched, yet. In almost every study, human scoring could be replicated with high accuracy, which led to major advancements in automating and scaling formative assessments. Moreover, several sources for misclassifications could as well be identified. Future research can build upon these insights to increase the machine-human score agreement of their assessment.

Prompt structure in ML-detected reasoning

When the possibility of applying ML in chemistry assessment emerged, many methodological considerations were initially necessary to design ML-based assessments. This has led to a research gap in integrating technological findings with aspects of mechanistic reasoning in chemistry education research. In some of the earlier studies in our sample (cf., Table 2) rather unspecific prompts (level 1) were developed, for example, “Explain your answer for the above question” (Haudek et al., 2009, p. 4; Haudek et al., 2012, p. 284; Haudek et al., 2015, p. 7), “How can you tell?” (Prevost et al., 2012a, p. 3; Prevost et al., 2012b, p. 2), “Explain your reasoning for your choice above”, and “Why do you think this is so?” (Prevost et al., 2012b, p. 4, 5). In their limitation section, Prevost et al. (2012b) realized that specific prompts probably would have motivated students more to apply their chemical content knowledge extensively. Reaching the next level, five other studies developed more specific prompts (level 2) to elicit mechanistic reasoning.

Dood et al. (2018, 2020a), Noyes et al. (2020), Yik et al. (2021), and Watts et al. (2023) developed their research instrument based on the findings of Cooper et al. (2016): in Dood et al. (2018, 2020a) and Yik et al. (2021), students first had to describe what happens on a molecular level for a specific acid–base reaction or a unimolecular nucleophilic substitution, respectively. After that, students had to explain why the reaction occurs. Similarly, Noyes et al. (2020) investigated learners’ explanations of London dispersion forces by asking what happens with the electrons during the attraction process and why the atoms attract. With a focus on the how and why, Watts et al. (2023) elicited students’ content knowledge of reaction mechanisms in a writing-to-learn activity by prompting them to describe and explain racemization and acid hydrolysis of a thalidomide molecule as well as a base-free Wittig reaction. In another writing-to-learn activity, Winograd et al. (2021b) asked students to respond to a fake social media prompt that addressed chemical equilibrium and Le Châtelier's principle in the authentic context of ocean acidification. By scaffolding the problem-solving process with structured questions (level 3), these studies used appropriate prompts for an ML analysis of student reasoning.

Taken together, mechanistic reasoning, with or without ML techniques, can only be assessed, if prompts elicit the desired, reasoning skills. Recent studies in chemistry education research have shown the effectiveness of scaffolded prompts (Cooper et al., 2016; Caspari and Graulich, 2019; Graulich and Caspari, 2020; Watts et al., 2021; Kranz et al., 2023). According to these findings, ML-related studies should pay attention to the structure and wording of prompts to collect evidence of mechanistic reasoning.

Sample heterogeneity in ML-detected reasoning

In many ML-related research projects focusing on mechanistic reasoning, the homogeneity of the sample population is mentioned as a limitation (cf., Table 2). In thirteen studies, data is collected at a single institution (level 1), while six others collected data at different institutions (level 2). However, only Noyes et al. (2020) gathered data from three institutions to investigate differential effects (level 3), that is, how different institutional backgrounds, curricula, and student ethnicity influence the accuracy of ML models. They built an initial model fed with responses from a single student population and a combined model fed with responses from four diverse student populations. Cross-validation showed that the combined model performed just like the initial model. However, split-validation revealed that the combined model performed slightly better than the initial model in predicting the level of causal mechanistic reasoning for responses from different institutions. Consequently, the combined model could capture mechanistic reasoning about London dispersion forces more equitably in other contexts, documenting the need for collecting responses from students with diverse backgrounds.

Overall, the reviewed studies highlight that there is a risk of producing biased ML models. When the data sample is too homogenous or when bias is introduced in the way prompts are structured, rubrics are designed, and responses are coded, ML models will replicate this bias. The development of automated scoring models goes, according to Cheuk (2021), hand in hand with questions of equity. Future studies should consider how bias is introduced into their ML model, how the parameters of an algorithm must be set to reduce bias, and which differential effects occur.

Implications and future work

Based on the ML-adapted ECD, we reviewed the selected studies in each of the six categories. However, due to the rapid ML progresses, the question arises of what needs to be done to further advance the application of ML in science assessment. We expanded the ML-adapted ECD by describing a fourth level that points out future implications (cf., Fig. 4).
image file: d2rp00287f-f4.tif
Fig. 4 Next level for implementing evidence-centered design in machine learning-based science assessment.

Pedagogical purpose of ML application

The pedagogical benefits of ML's application have not yet been fully explored. For instance, in a learning environment using Just-in-Time Teaching (Novak et al., 1999) guided by ML, an instructor can longitudinally assign online homework questions to students. After working on their homework, students can receive automated feedback and tailored exercises. This concept can also be realized with an Application Program Interface, which can automatically assign tailored learning assistance in terms of follow-up tutorials over a semester to each student. Including students’ experiences, interests, and interdisciplinary and extracurricular skills in the evaluation may be possible as well. This technology-based approach to adaptive learning helps students self-assess their improvements and deficits, whereas instructors can monitor students’ proficiency – just before the next class session starts. With an overview of students’ competencies, teachers can then tailor their lessons instantly.

Beyond research focusing on delivering automated guidance directly to students, there is a need for research on teacher dashboards. Showing ML output to provide instructors with insights into the ways students respond can be useful, as it helps streamline the evaluation process. Furthermore, dashboards allow instructors to tailor direct guidance and provide individualized feedback to students (Urban-Lurain et al., 2013).

Rubric design

To facilitate the rubric design for ML-based formative assessment, Raker et al. (2023) aggregated previously applied frameworks for evaluating mechanistic reasoning by defining four hierarchical levels of explanation sophistication: absent, descriptive, foundational, and complex. They argue that their framework allows to predict an overall level of sophistication as well as to evaluate different reaction aspects, e.g., electrophiles, nucleophiles, carbocation stability, leaving group stability, proton transfer, and solvent effects. To show the applicability of their framework, Raker et al. (2023) derived a rubric for the evaluation of explanation sophistication for electrophiles, independent of the reaction context. The operationalization of the proposed framework for rubric design in ML-based assessment may allow for a fine-grained evaluation of student responses.

However, most rubrics have been built deductively, which means that researchers set frameworks first and analyze data according to this framework afterward. Unsupervised pattern recognition hold, though, great potential in designing rubrics inductively. By applying a three-step process called computational grounded theory (Nelson, 2020), unsupervised ML can be integrated with domain-specific human expertise to raise the reliability and validity of rubrics: in a first step, unsupervised algorithms can detect patterns in data exploratively to gain breadth and quantity. After that, humans can interpret these emerging patterns to add depth and quality. In the end, algorithms can check if the qualitative human coding represents the structure of the data and if it can be applied reliably. Unsupervised ML and humans, thus, form a symbiosis in data interpretation.

For example, Rosenberg and Krist (2021) applied computational grounded theory to develop a nuanced rubric for evaluating the epistemic characteristics of the generality of model-based explanations. They combined a two-step clustering approach with qualitative, interpretative coding to derive a fine-grained rubric inductively from students’ responses. The combination of unsupervised ML and qualitative coding finally revealed the complexity of students' conceptions. In the future, this approach may help develop computerized instructional systems that cover the whole range of student sophistication, e.g., in terms of mechanistic reasoning.

Construct assessment

In many studies, constructs have been assessed by grouping responses into multiple categories. When running human- or computer-set if-then commands, this category-based approach comes with limitations. Categories formed on human- or computer-set if-then commands may not necessarily capture the heterogeneity of students' ideas since they only identify the presence of chemical concepts but not their sophistication. To mitigate this constraint, categories need to be generated to which only responses with a high level of sophistication are assigned (e.g., Haudek et al., 2012; Dood et al., 2018). Furthermore, the covariance of different categories should be examined using web diagrams (e.g., Haudek et al., 2012; Prevost et al., 2012b). Web diagrams do not only visualize the number of responses in each category, but they also represent connections between categories. However, one has to keep in mind that a wide range of responses may be assigned to a single category. So, the examination of conceptual categories may only reveal that specific concepts are integrated into a response, but not to what extent the concepts are elaborated, how plausible their connections are, and how complex an explanation is.

Moreover, n-grams and term frequency weighting allow only for identifying and counting n adjacent words so that a response is just represented by a sequence of numbers, called a bag of words. Given that the meaning of a word significantly depends on the context, semantic, logical, and technical relationships between phrases may only be analyzed superficially with this technique. n-Grams and related approaches cannot elicit contextual embeddings, cross-references, and implicit statements. In the future, cutting-edge technologies like BERT may better reveal contextual embeddings in written responses.

Furthermore, the definition of constructs often relies on written features when capturing mechanistic reasoning, resulting in constructs that are solely based on students’ learning products (Mislevy, 2016; Zhai et al., 2020a). However, ML algorithms can include other types of data in their predictive model, e.g., log data (e.g., Zhu et al., 2017; Lee et al., 2021). Gathering process data with high-tech devices or sensory information may also facilitate the definition of new constructs. For example, Ghali et al. (2016) developed the educational game LewiSpace to teach drawing Lewis structures. They collected data about emotional and cognitive states when predicting whether a learner will succeed or fail at a level of the game. Primarily, they used three types of sensors, such as electroencephalography, eye tracking, and facial expression recognition with an optical camera, to predict learners’ success rates. Integrating knowledge from neuroscience, psychology, and computer science like behavioral, psychological, and brain wave data (e.g., Wehbe et al., 2014; Mason and Just, 2016) may finally help include state of emotion, cognition, and affect into science assessment (Kubsch et al., 2022a).

Validation approaches

In most cases, algorithms have been trained to detect categories or n-grams, but this approach is a bit outdated. Recent applications of natural language processing use BERT, either for evaluating a writing-to-learn activity (Winograd et al., 2021b) or peer review (Winograd et al., 2021a; Dood et al., 2022). BERT is pre-trained with billions of sentences, which means that fewer input data, but more computation costs are needed. Notably, BERT helps detect the structure of arguments in longer texts, which makes it useful for analyzing essays with domain-general content. In science assessment, ML is, however, mostly applied for the evaluation of short, content-rich explanations. So, future research needs to check whether such cutting-edge techniques can also be used to accurately evaluate short, content-rich scientific explanations. Maybe, such technologies will produce better outcome metrics than the n-gram approach; Dood et al. (2022) and Winograd et al. (2021b, 2021b) in chemistry education research as well as Wulff et al. (2022a, 2022b) in physics education research have laid the foundation for future research in this area.

Prompt structure

The construction of sound, logical, and coherent explanations is needed to reason appropriately about underlying processes, which requires well-designed prompts. Many prompts that elicit not only what happens but also why reactions occur have already been developed in ML-based formative assessments (Dood et al., 2018, 2020a; Noyes et al., 2020; Yik et al., 2021; Watts et al., 2023). These scaffolded prompts serve as an instructional guide when organizing explanations about mechanistic processes; however, any scaffold must be understood as a temporary support. Once a student has internalized the intended explanatory pattern, the instructional support should be faded (McNeill et al., 2006; Noroozi et al., 2018). To advance ML's application in the category prompt structure, automated text analysis could be used to immediately evaluate which prompt type is most suitable for students. By offering adapted and faded prompts afterward, students can get instructional support based on their current level of sophistication.

Sample heterogeneity

If the data used to train an ML algorithm is not representative of a diverse student population, the algorithm may be biased toward certain groups or individuals. Building upon the work of Noyes et al. (2020) it might, thus, be necessary to investigate the differential effects of an algorithm further, i.e., how gender, age, cultural and linguistic background, socioeconomic status, educational level and resources, learning environment and challenges, prior knowledge, individual experiences, or motivation impact the conclusions drawn from automated analysis. With the help of such investigations, several steps could be taken to minimize the observed differential effects. First, gathering a heterogeneous dataset for the training of an algorithm helps ensure that the model is not biased toward a particular group of students. Subsequently, counterfactual analyses can provide insight into how particular algorithmic decisions are made, which helps identify and mitigate any biases. Influenced by the findings of counterfactual analyses, bias-sensitive algorithms can be programmed, which can be validated using metrics accounting for fairness. However, despite all technological opportunities, domain experts remain irreplaceable in revising and adjusting the algorithm.

Conclusions

The primary goal of this review was to illustrate the advancements and shortcomings that ML has brought to the formative assessment of mechanistic reasoning by providing thorough detail of the contemporary effective practices in this area. To answer our first guiding question, various contemporary effective practices have been identified as indicated by the level 3 classification in Table 2. However, the small number of articles shows that much research may still be needed to further advance ML's application in chemistry assessment, especially to capture complex reasoning.

Synthesizing the reviewed studies to answer the other two guiding questions revealed that applying ML advanced the analysis of mechanistic reasoning in some categories of the ML-adapted ECD. The tenets of the evidence space are comprehensively addressed by current research studies. Due to the design of multi-level rubrics capturing different sophistication levels, evidence rules that determine how to analyze features in written responses are characterized in detail. Additionally, the technological progress of recent years has facilitated the construction of measurement models that assess conceptual and semantic features as shown by Winograd et al. (2021b) and Watts et al. (2023). Besides, validation approaches are extensively reported in the reviewed studies. Future research should, following this, use fine-grained rubrics in conjunction with cutting-edge technologies to continue these contemporary effective practices. About the task space, the analysis turns out to be more differentiated as both, advancements and shortcomings, can be identified. Developing scaffolded prompts helped elicit evidence of mechanistic reasoning as shown by many studies. However, only Noyes et al. (2020) investigated differential effects in their sample, which leads us to the conclusion that a low sample heterogeneity is a major limitation in many studies. Further shortcomings can be identified in the claim space since in more than half of the selected studies the pedagogical purpose of ML application currently does not go beyond automating the scoring of students’ mechanistic reasoning. That is, many studies did not envisage beforehand how to embed ML analysis optimally in a learning environment. To this point, research needs to be done that investigates students’ and instructors’ expectations of ML output in automated science assessment.

The identified advancements and shortcomings of our analysis are in alignment with other reviews and perspectives (Zhai et al., 2020a, 2020c; Kubsch et al., 2023). In their dimension construct, Zhai et al. (2020a) found that 81% of the reviewed studies tapped complex constructs indicating the great potential of ML in assessing such constructs. Since mechanistic reasoning is a complex construct per se (cf., Bachtiar et al., 2022), our review confirms Zhai et al.'s (2020a) finding. However, in many studies of our sample, there is no evidence that the investigated construct could solely be analyzed by ML. Instead, the primary intention of most reviewed studies was to replicate human scoring and deliver conventional science assessments faster. Hence, ML's potential in deepening human evaluation, analyzing more nuanced cognitive processes, and dealing with big data is not fully utilized, yet (Zhai et al., 2020c).

For the automaticity of an assessment, we have, just like Zhai et al. (2020a), found that most studies advanced the application of straightforward supervised techniques to automate, replicate, and scale human scoring of students’ responses from valid and reliable assessment instruments, while only a few employed unsupervised techniques to reveal hidden patterns. This finding indicates that the automaticity of scoring complex constructs with supervised techniques is already well-researched. However, as shown by the studies we reviewed, many formative assessments are only automated for a rather homogeneous sample because most studies collected data at a single institution.

Furthermore, many studies in our sample provide limited improvement to assessment functionality. Although technological advances have made it increasingly possible to capture conceptual as well as semantic features in text responses, the pedagogical purpose of applying ML is usually not envisioned a priori. This seems to be a common issue in current ML studies, as Zhai et al. (2020a, 2020c) found as well that not even half of the reviewed studies reported how to embed their assessment in a learning environment. In the future, more studies are needed to investigate the opportunities of ML for individually and adaptively supporting students’ reasoning. Integrating high-quality learning assistance into high-functional ML-based instructional settings ultimately allows for adaptively supporting students in acquiring scientific competence.

In sum, ML techniques hold great potential in supporting humans in data analysis. Subsequently, additional data sources can be combined to extend the validity of conclusions (Kubsch et al., 2021). To date, research has mostly focused on assessment automation, which only changes the quantity of research efforts, but not its quality (Nelson, 2020; Kubsch et al., 2023). Hence, we get only more, but the same insights as in traditional assessments. Focusing on automation leaves many ML methods unexplored, which is why the quality of evidence from data does not change. In the future, it is, for instance, necessary to integrate computational techniques with interpretative human coding to augmentatively raise the validity of both and gain new insights into cognitive processes (Sherin, 2013; Nelson, 2020; Nelson et al., 2021; Kubsch et al., 2023).

Along with other researchers (Zhai et al., 2020a, 2020c; Kubsch et al., 2023), we suggest exploring the transformative character of ML in science education beyond automating formative assessment. This includes moving away from perceiving humans as consumers of algorithmic output that takes certain decision-making processes away since such output rarely answers research questions related to the qualitative analysis of cognitive operations (Kubsch et al., 2023). In other words, ML algorithms do not replace human qualitative interpretation; thoughtful human interpretation is more crucial than ever.

Author contributions

Paul P. Martin: conceptualization, investigation, methodology, formal analysis, visualization, writing – original draft. Nicole Graulich: supervision, conceptualization, investigation, methodology, visualization, writing – review and editing.

Conflicts of interest

There are no conflicts to declare.

Appendix: extended glossary for ML-related terms

Definition of ML

ML is a field that intersects computer science, data science, statistics, and artificial intelligence (AI), and assumes that algorithms can learn from previous experience without explicit programming (Samuel, 1959). Therefore, ML can be defined as “a computer program [that] is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” (Mitchell, 1997, p. 2). The key of this definition is that ML “is concerned with the question of how to construct computer programs that automatically improve with experience” (Mitchell, 1997, p. XV). In this context, the term experience refers to some kind of information, such as labels in data.

The procedure of using ML

Applying ML mainly contains two stages: learning from existing experience (training phase) and predicting new inputs (testing phase). In the training phase, a set of features is initially extracted from a dataset using pre-processing methods. Then, an algorithmic model can be generated based on the examined features of the existing data. Finally, this algorithmic model can be deployed to analyze the testing data.

Supervised ML

Supervised ML has the goal of mimicking the human ability to extract features from datasets. In doing so, the algorithm must be fed with already labeled data. After replicating the known input-output relationship, the algorithm deciphers the underlying patterns. Finally, labels in unknown data can be predicted.

Unsupervised ML

Unsupervised ML techniques operate with unlabeled data, which means these techniques are deployed to reveal hidden patterns. All these hidden patterns can afterward be utilized to establish new labels. Since there is no need for human coding at all, unsupervised approaches save time, resources, costs, and effort.

Semi-supervised ML

A hybrid form of both supervised and unsupervised ML is semi-supervised ML, deployed for data consisting of labeled and unlabeled cases. After building a model using labeled data, the semi-supervised algorithm is utilized to forecast unlabeled data. The model can then be adjusted and re-trained with larger datasets and less coding effort.

Self-validation

Self-validation uses the same data for constructing and evaluating the ML model. Since the quality of the model is not measured for new data, the approach of split-validation is often preferred.

Split-validation

When applying split-validation, the dataset must be randomly divided into a training and a testing set to analyze the accuracy of predictions even for unknown inputs.

Cross-validation

Cross-validation carries the idea of split-validation. For k-fold cross-validation, data is first partitioned into k randomly selected subsets with the same number of responses. Whereas (k − 1) folds are then used for model training, the remaining fold is withheld for model validation. The codes of the remaining fold are predicted by the algorithm, which is built on the other (k − 1) folds. This method is repeated until each of the k validation sets has been coded by the on (k − 1) folds built model. Finally, an average agreement between human and computer codes can be calculated.

Reference to AI

The application of ML and AI increases not only in the field of science education but also in medicine, mobility, media, marketing, fraud detection, and face recognition; however, there is hardly a distinction between the two terms (Kühl et al., 2020). This fact can be explained by the various relationships between the two technologies. ML algorithms benefit from training to solve real-world problems automatically through computational resources (Mohri et al., 2012). ML represents, thus, a set of methods for learning from patterns in data (Kühl et al., 2020). AI is, in turn, an application of several techniques, among others ML, which aim to perform intelligent and cognitive tasks while imitating human behavior. In other words, AI refers to machines that simulate human reasoning steps (Newell and Simon, 1961; Bellmann, 1978; Haugeland, 1989), leading to the criterion that AI must accomplish human tasks at least as well as humans do (Rich et al., 2009).

Acknowledgements

This publication is part of the first author's doctoral thesis (Dr. rer. nat.) at the Faculty of Biology and Chemistry, Justus-Liebig-University Giessen, Germany. We thank Marcus Kubsch for a thoughtful discussion of the manuscript. Furthermore, we thank all members of the Graulich group for fruitful discussions. Finally, Paul P. Martin thanks Amber Dood, Field Watts, and Brandon Yik for the helpful meetings on machine learning in chemistry education research.

References

  1. Allen D. and Tanner K., (2006), Rubrics: Tools for Making Learning Goals and Evaluation Criteria Explicit for Both Teachers and Learners, CBE Life Sci. Educ., 5, 197–203.
  2. Bachtiar R. W., Meulenbroeks R. F. G. and van Joolingen W. R., (2022), Mechanistic reasoning in science education: a literature review, EURASIA J. Math. Sci. Tech. Ed., 18, em2178.
  3. Beggrow E. P., Ha M., Nehm R. H., Pearl D. and Boone W. J., (2014), Assessing Scientific Practices Using Machine-Learning Methods: How Closely Do They Match Clinical Interview Performance? J. Sci. Educ. Technol., 23, 160–182.
  4. Bellmann R., (1978), An Introduction to Artificial Intelligence: Can Computers Think? Boyd and Fraser.
  5. Birenbaum M. and Tatsuoka K. K., (1987), Open-Ended Versus Multiple-Choice Response Formats – It Does Make a Difference for Diagnostic Purposes, Appl. Psychol. Meas., 11, 385–395.
  6. Bishop C. M., (2006), Pattern Recognition and Machine Learning, New York: Springer.
  7. Bolger M. S., Kobiela M., Weinberg P. J. and Lehrer R., (2012), Children's Mechanistic Reasoning, Cogn. Instr., 30, 170–206.
  8. Carey S., (1995), Causal cognition: A multidisciplinary debate, New York, NY, US: Clarendon Press/Oxford University Press, pp. 268–308.
  9. Caspari I. and Graulich N., (2019), Scaffolding the structure of organic chemistry students’ multivariate comparative mechanistic reasoning, Int. J. Physc. Chem. Ed., 11, 31–43.
  10. Caspari I., Kranz D. and Graulich N., (2018a), Resolving the complexity of organic chemistry students' reasoning through the lens of a mechanistic framework, Chem. Educ. Res. Pract., 19, 1117–1141.
  11. Caspari I., Weinrich M., Sevian H. and Graulich N., (2018b), This mechanistic step is “productive”: organic chemistry students’ backward-oriented reasoning, Chem. Educ. Res. Pract., 19, 42–59.
  12. Cheuk T., (2021), Can AI be racist? Color-evasiveness in the application of machine learning to science assessments, Sci. Educ., 105, 825–836.
  13. Cooper M. M., (2015), Why Ask Why? J. Chem. Educ., 92, 1273–1279.
  14. Cooper M. M., Kouyoumdjian H. and Underwood S. M., (2016), Investigating Students’ Reasoning about Acid–Base Reactions, J. Chem. Educ., 93, 1703–1712.
  15. Deeva G., Bogdanova D., Serral E., Snoeck M. and De Weerdt J., (2021), A review of automated feedback systems for learners: Classification framework, challenges and opportunities, Comput. Educ., 162, 104094.
  16. DeGlopper K. S., Schwarz C. E., Ellias N. J. and Stowe R. L., (2022), Impact of Assessment Emphasis on Organic Chemistry Students’ Explanations for an Alkene Addition Reaction, J. Chem. Educ., 99, 1368–1382.
  17. Deng J. M., Rahmani M. and Flynn A. B., (2022), The role of language in students’ justifications of chemical phenomena, Int. J. Sci. Educ., 44, 2131–2151.
  18. diSessa A. A., (1993), Toward an Epistemology of Physics, Cogn. Instr., 10, 105–225.
  19. Donnelly D. F., Vitale J. M. and Linn M. C., (2015), Automated Guidance for Thermodynamics Essays: Critiquing Versus Revisiting, J. Sci. Educ. Technol., 24, 861–874.
  20. Dood A. J. and Watts F. M., (2022a), Mechanistic Reasoning in Organic Chemistry: A Scoping Review of How Students Describe and Explain Mechanisms in the Chemistry Education Research Literature, J. Chem. Educ., 99, 2864–2876.
  21. Dood A. J. and Watts F. M., (2022b), Students’ Strategies, Struggles, and Successes with Mechanism Problem Solving in Organic Chemistry: A Scoping Review of the Research Literature, J. Chem. Educ., 100, 53–68.
  22. Dood A. J., Fields K. B. and Raker J. R., (2018), Using Lexical Analysis to Predict Lewis Acid-Base Model Use in Response to an Acid-Base Proton-Transfer Reaction, J. Chem. Educ., 95, 1267–1275.
  23. Dood A. J., Fields K. B., Cruz-Ramírez de Arellano D. and Raker J. R., (2019), Development and evaluation of a Lewis acid-base tutorial for use in postsecondary organic chemistry courses, Can. J. Chem., 97, 711–721.
  24. Dood A. J., Dood J. C., Cruz-Ramírez de Arellano D., Fields K. B. and Raker J. R., (2020a), Analyzing explanations of substitution reactions using lexical analysis and logistic regression techniques, Chem. Educ. Res. Pract., 21, 267–286.
  25. Dood A. J., Dood J. C., Cruz-Ramírez de Arellano D., Fields K. B. and Raker J. R., (2020b), Using the Research Literature to Develop an Adaptive Intervention to Improve Student Explanations of an SN1 Reaction Mechanism, J. Chem. Educ., 97, 3551–3562.
  26. Dood A. J., Winograd B. A., Finkenstaedt-Quinn S. A., Gere A. R. and Shultz G. V., (2022), PeerBERT: Automated Characterization of Peer Review Comments Across Courses, in Proceedings of the LAK22: 12th International Learning Analytics and Knowledge Conference, New York, NY, pp. 492–499.
  27. Gerard L. F., Matuk C., McElhaney K. and Linn M. C., (2015), Automated, adaptive guidance for K-12 education, Educ. Reas. Rev., 15, 41–58.
  28. Gerard L. F., McElhaney K. W., Rafferty A. N., Ryoo K., Liu O. L. and Linn M. C., (2016), Automated Guidance for Student Inquiry, J. Educ. Psychol., 108, 60–81.
  29. Ghali R., Ouellet S. and Frasson C., (2016), LewiSpace: an Exploratory Study with a Machine Learning Model in an Educational Game, J. Educ. Train. Stud., 4, 192–201.
  30. Glaser R., Lesgold A. and Lajoie S., (1987), Toward a Cognitive Theory for the Measurement of Achievement, in Ronning R. R., Glover J. A., Conoley J. C. and Witt J. C. (ed.), The Influence of Cognitive Psychology on Testing and Measurement, Lawrence Erlbaum, pp. 41–85.
  31. Glennan S., (2002), Rethinking Mechanistic Explanation, Philos. Sci., 69, S342–S353.
  32. Gobert J. D., Sao Pedro M., Raziuddin J. and Baker R. S., (2013), From Log Files to Assessment Metrics: Measuring Students' Science Inquiry Skills Using Educational Data Mining, J. Learn. Sci., 22, 521–563.
  33. Gobert J. D., Baker R. and Wixon M. B., (2015), Operationalizing and Detecting Disengagement Within Online Science Microworlds, Educ. Psychol., 50, 43–57.
  34. Graulich N., (2015), The tip of the iceberg in organic chemistry classes: how do students deal with the invisible? Chem. Educ. Res. Pract., 16, 9–21.
  35. Graulich N. and Caspari I., (2020), Designing a scaffold for mechanistic reasoning in organic chemistry, Chem. Teach. Int., 3, 19–30.
  36. Graulich N. and Schween M., (2018), Concept-Oriented Task Design: Making Purposeful Case Comparisons in Organic Chemistry, J. Chem. Educ., 95, 376–383.
  37. Grimberg B. I. and Hand B., (2009), Cognitive Pathways: Analysis of students' written texts for science understanding, Int. J. Sci. Educ., 31, 503–521.
  38. Grove N. P. and Lowery Bretz S., (2012), A continuum of learning: from rote memorization to meaningful learning in organic chemistry, Chem. Educ. Res. Pract., 13, 201–208.
  39. Ha M. and Nehm R., (2016), The Impact of Misspelled Words on Automated Computer Scoring: A Case Study of Scientific Explanations, J. Sci. Educ. Technol., 25, 358–374.
  40. Ha M., Nehm R. H., Urban-Lurain M. and Merrill J. E., (2011), Applying Computerized-Scoring Models of Written Biological Explanations across Courses and Colleges: Prospects and Limitations, CBE Life Sci. Educ., 10, 379–393.
  41. Hammer D., (2000), Student resources for learning introductory physics, Am. J. Phys., 68, S52–S59.
  42. Haudek K. C. and Zhai X., (2021), Exploring the Effect of Assessment Construct Complexity on Machine Learning Scoring of Argumentation, Presented in part at the National Association of Research in Science Teaching Annual Conference, Virtual.
  43. Haudek K. C., Moscarella R. A., Urban-Lurain M., Merrill J. E., Sweeder R. D. and Richmond G., (2009), Using lexical analysis software to understand student knowledge transfer between chemistry and biology, Presented in part at the National Association of Research in Science Teaching Annual Conference, Garden Grove, CA.
  44. Haudek K. C., Kaplan J. J., Knight J., Long T. M., Merrill J. E., Munn A., Nehm R. H., Smith M. and Urban-Lurain M., (2011), Harnessing Technology to Improve Formative Assessment of Student Conceptions in STEM: Forging a National Network, CBE Life Sci. Educ., 10, 149–155.
  45. Haudek K. C., Prevost L. B., Moscarella R. A., Merrill J. E. and Urban-Lurain M., (2012), What Are They Thinking? Automated Analysis of Student Writing about Acid-Base Chemistry in Introductory Biology, CBE Life Sci. Educ., 11, 283–293.
  46. Haudek K. C., Moscarella R. A., Weston M., Merrill J. E. and Urban-Lurain M., (2015), Construction of rubrics to evaluate content in students' scientific explanation using computerized text analysis, Presented in part at the National Association of Research in Science Teaching Annual Conference, Chicago, IL.
  47. Haudek K. C., Wilson C. D., Stuhlsatz M. A. M., Donovan B., Bracey Z. B., Gardner A., Osborne J. F. and Cheuk T., (2019), Using automated analysis to assess middle school students' competence with scientific argumentation, Presented in part at the National Conference on Measurement in Education (NCME), Annual Conference, Toronto, ON.
  48. Haugeland J., (1989), Artificial Intelligence: The Very Idea, MIT Press.
  49. Illari P. M. and Williamson J., (2012), What is a mechanism? Thinking about mechanisms across the sciences, Eur. J. Philos. Sci., 2, 119–135.
  50. Jescovitch L. N., Doherty J. H., Scott E. E., Cerchiara J. A., Wenderoth M. P., Urban-Lurain M., Merrill J. E. and Haudek K. C., (2019a), Challenges in Developing Computerized Scoring Models for Principle-Based Reasoning in a Physiology Context, Presented in part at the National Association of Research in Science Teaching Annual Conference, Baltimore, MD.
  51. Jescovitch L. N., Scott E. E., Cerchiara J. A., Doherty J. H., Wenderoth M. P., Merrill J. E., Urban-Lurain M. and Haudek K. C., (2019b), Deconstruction of Holistic Rubrics into Analytic Bins for Large-Scale Assessments of Students' Reasoning of Complex Science Concepts, Pract. Assess. Res. Eval., 24, 1–13,  DOI:10.7275/9h7f-mp76.
  52. Jescovitch L. N., Scott E. E., Cerchiara J. A., Merrill J. E., Urban-Lurain M., Doherty J. H. and Haudek K. C., (2021), Comparison of Machine Learning Performance Using Analytic and Holistic Coding Approaches Across Constructed Response Assessments Aligned to a Science Learning Progression, J. Sci. Educ. Technol., 30, 150–167.
  53. Kang H., Thompson J. and Windschitl M., (2014), Creating Opportunities for Students to Show What They Know: The Role of Scaffolding in Assessment Tasks, Sci. Educ., 98, 674–704.
  54. Kerr P., (2016), Adaptive learning, ELT J., 70, 88–93.
  55. Kraft A., Strickland A. M. and Bhattacharyya G., (2010), Reasonable reasoning: multi-variate problem-solving in organic chemistry, Chem. Educ. Res. Pract., 11, 281–292.
  56. Kranz D., Schween M. and Graulich N., (2023), Patterns of reasoning – exploring the interplay of students’ work with a scaffold and their conceptual knowledge in organic chemistry, Chem. Educ. Res. Pract.,  10.1039/d2rp00132b.
  57. Krist C., Schwarz C. V. and Reiser B. J., (2019), Identifying Essential Epistemic Heuristics for Guiding Mechanistic Reasoning in Science Learning, J. Learn. Sci., 28, 160–205.
  58. Kubsch M., Rosenberg J. M. and Krist C., (2021), Beyond Supervision: Human/Machine Distributed Learning in Learning Sciences Research, in Proceedings of the 15th International Conference of the Learning Sciences-ICLS 2021, Bochum, Germany, pp. 897–898.
  59. Kubsch M., Caballero D. and Uribe P., (2022a), Once More with Feeling: Emotions in Multimodal Learning Analytics, in Giannakos M., Spikol D., Di Mitri D., Sharma K., Ochoa X. and Hammad R. (ed.), The Multimodal Learning Analytics Handbook, Cham: Springer International Publishing, pp. 261–285.
  60. Kubsch M., Czinczel B., Lossjew J., Wyrwich T., Bednorz D., Bernholt S., Fiedler D., Strauß S., Cress U., Drachsler H., Neumann K. and Rummel N., (2022b), Toward learning progression analytics—Developing learning environments for the automated analysis of learning using evidence centered design, Front. Educ., 7, 1–15,  DOI:10.3389/feduc.2022.981910.
  61. Kubsch M., Krist C. and Rosenberg J. M., (2023), Distributing epistemic functions and tasks – A framework for augmenting human analytic power with machine learning in science education research, J. Res. Sci. Teach., 60, 423–447.
  62. Kuechler L. W. and Simkin M. G., (2010), Why Is Performance on Multiple-Choice Tests and Constructed-Response Tests Not More Closely Related? Theory and an Empirical Test, Dec. Sci. J. Innov. Educ., 8, 55–73.
  63. Kühl N., Goutier M., Hirt R. and Satzger G., (2020), Machine Learning in Artificial Intelligence: Towards a Common Understanding, arXiv, preprint, arXiv:2004.04686,  DOI:10.48550/arXiv.2004.04686.
  64. Lee H.-S., Liu O. L. and Linn M. C., (2011), Validating Measurement of Knowledge Integration in Science Using Multiple-Choice and Explanation Items, Appl. Meas. Educ., 24, 115–136.
  65. Lee H.-S., Gweon G.-H., Lord T., Paessel N., Pallant A. and Pryputniewicz S., (2021), Machine Learning-Enabled Automated Feedback: Supporting Students' Revision of Scientific Arguments Based on Data Drawn from Simulation, J. Sci. Educ. Technol., 30, 168–192.
  66. Lieber L. S. and Graulich N., (2020), Thinking in Alternatives—A Task Design for Challenging Students’ Problem-Solving Approaches in Organic Chemistry, J. Chem. Educ., 97, 3731–3738.
  67. Lieber L. S. and Graulich N., (2022), Investigating students' argumentation when judging the plausibility of alternative reaction pathways in organic chemistry, Chem. Educ. Res. Pract., 23, 38–53.
  68. Lieber L. S., Ibraj K., Caspari-Gnann I. and Graulich N., (2022a), Closing the gap of organic chemistry students’ performance with an adaptive scaffold for argumentation patterns, Chem. Educ. Res. Pract., 23, 811–828.
  69. Lieber L. S., Ibraj K., Caspari-Gnann I. and Graulich N., (2022b), Students’ Individual Needs Matter: A Training to Adaptively Address Students’ Argumentation Skills in Organic Chemistry, J. Chem. Educ., 99, 2754–2761.
  70. Linn M. C. and Eylon B.-S., (2011), Science learning and instruction: Taking advantage of technology to promote knowledge integration, New York, NY: Routledge.
  71. Linn M. C., Gerard L. F., Ryoo K., McElhaney K., Liu O. L. and Rafferty A. N., (2014), Education technology. Computer-guided inquiry to improve science learning, Science, 344, 155–156.
  72. Liu O. L., Brew C., Blackmore J., Gerard L., Madhok J. and Linn M. C., (2014), Automated Scoring of Constructed-Response Science Items: Prospects and Obstacles, Educ. Meas, 33, 19–28.
  73. Liu O. L., Rios J. A., Heilman M., Gerard L. and Linn M. C., (2016), Validation of Automated Scoring of Science Assessments, J. Res. Sci. Teach., 53, 215–233.
  74. Machamer P., Darden L. and Craver C. F., (2000), Thinking About Mechanisms, Philos. Sci., 67, 1–25.
  75. Maestrales S., Zhai X., Touitou I., Baker Q., Schneider B. and Krajcik J., (2021), Using Machine Learning to Score Multi-Dimensional Assessments of Chemistry and Physics, J. Sci. Educ. Technol., 30, 239–254.
  76. Mao L., Liu O. L., Roohr K., Belur V., Mulholland M., Lee H.-S. and Pallant A., (2018), Validation of Automated Scoring for a Formative Assessment that Employs Scientific Argumentation, Educ. Assess., 23, 121–138.
  77. Mason R. A. and Just M. A., (2016), Neural Representations of Physics Concepts, Psychol. Sci., 27, 904–913.
  78. McNeill K. L., Lizotte D. J., Krajcik J. and Marx R. W., (2006), Supporting Students' Construction of Scientific Explanations by Fading Scaffolds in Instructional Materials, J. Learn. Sci., 15, 153–191.
  79. Messick S., (1994), The Interplay of Evidence and Consequences in the Validation of Performance Assessments, Educ. Res., 23, 13–23.
  80. Mislevy R. J., (2006), Cognitive psychology and educational assessment, in Brennan R. L. (ed.), Educational measurement, Phoenix: Greenwood Press, vol. 4, pp. 257–305.
  81. Mislevy R. J., (2016), How Developments in Psychology and Technology Challenge Validity Argumentation, J. Educ. Meas., 53, 265–292.
  82. Mislevy R. J. and Haertel G. D., (2007), Implications of Evidence-Centered Design for Educational Testing, Educ. Meas, 25, 6–20.
  83. Mislevy R. J., Almond R. G. and Lukas J. F., (2003a), A Brief Introduction to Evidence-Centered Design, ETS Res. Rep. Ser., 2003, i–29.
  84. Mislevy R. J., Steinberg L. S. and Almond R. G., (2003b), Focus Article: On the Structure of Educational Assessments, Meas. Interdiscip. Sci Res. Per., 1, 3–62.
  85. Mitchell T. M., (1997), Machine Learning, New York, NY: McGraw Hill.
  86. Mjolsness E. and Decoste D., (2001), Machine Learning for Science: State of the Art and Future Prospects, Science, 293, 2051–2055.
  87. Mohri M., Rostamizadeh A. and Talwalkar A., (2012), Foundation of Machine Learning, Cambridge, MA London, England: The MIT Press.
  88. National Research Council, (2012), A Framework for K-12 Science Education: Practices, Crosscutting Concepts, and Core Ideas, National Academic Press.
  89. Nehm R. H., Ha M. and Mayfield E., (2012), Transforming Biology Assessment with Machine Learning: Automated Scoring of Written Evolutionary Explanations, J. Sci. Educ. Technol., 21, 183–196.
  90. Nelson L. K., (2020), Computational Grounded Theory: A Methodological Framework, Sociol. Methods Res., 49, 3–42.
  91. Nelson L. K., Burk D., Knudsen M. and McCall L., (2021), The Future of Coding: A Comparison of Hand-Coding and Three Types of Computer-Assisted Text Analysis Methods, Sociol. Methods Res., 50, 202–237.
  92. Newell A. and Simon H. A., (1961), GPS, A Program that Simulates Human Thought, in Billing H. (ed.), Lernende Automaten, München: Oldenbourg, pp. 109–124.
  93. Noroozi O., Kirschner P. A., Biemanns H. J. A. and Mulder M., (2018), Promoting Argumentation Competence: Extending from First- to Second-Order Scaffolding Through Adaptive Fading, Educ. Psychol. Rev., 30, 153–176.
  94. Novak G. M., Gavrin A., Patterson E. and Christian W., (1999), Just-In-Time Teaching: Blending Active Learning with Web Technology, Upper Saddle River NJ: Prentice Hall.
  95. Noyes K., McKay R. L., Neumann M., Haudek K. C. and Cooper M. M., (2020), Developing Computer Resources to Automate Analysis of Students' Explanations of London Dispersion Forces, J. Chem. Educ., 97, 3923–3936.
  96. Noyes K., Carlson C. G., Stoltzfus J. R., Schwarz C. V., Long T. M. and Cooper M. M., (2022), A Deep Look into Designing a Task and Coding Scheme through the Lens of Causal Mechanistic Reasoning, J. Chem. Educ., 99, 874–885.
  97. Pellegrino J. W., (2013), Proficiency in Science: Assessment Challenges and Opportunities, Science, 340, 320–323.
  98. Pellegrino J., DiBello L. and Goldman S., (2016), A Framework for Conceptualizing and Evaluating the Validity of Instructionally Relevant Assessments, Educ. Psychol., 51, 59–81.
  99. Prevost L. B., Haudek K. C., Merrill J. E. and Urban-Lurain M., (2012a), Deciphering student ideas on thermodynamics using computerized lexical analysis of student writing, Presented in part at the ASEE Annual Conference & Exposition, San Antonio, TX.
  100. Prevost L. B., Haudek K. C., Merrill J. E. and Urban-Lurain M., (2012b), Examining student constructed explanations of thermodynamics using lexical analysis, Presented in part at the 2012 IEEE Frontiers in Education Conference, Seattle, WA.
  101. Prevost L. B., Haudek K. C., Cooper M. M. and Urban-Lurain M., (2014), Computerized Lexical Analysis of Students' Written Interpretations of Chemical Representations, Presented in part at the National Association of Research in Science Teaching Annual Conference, Pittsburgh, PA.
  102. Rafferty A. N., Gerard L. F., McElhaney K. W. and Linn M. C., (2013), Automating Guidance for Students' Chemistry Drawings, Presented in part at the Artificial Intelligence in Education Conference, Memphis, TN.
  103. Rafferty A. N., Gerard L. F., McElhaney K. and Linn M. C., (2014), Promoting Student Learning through Automated Formative Guidance on Chemistry Drawings, in Proceedings of the International Society of the Learning Sciences, Boulder, CO, pp. 386–393.
  104. Raker J. R., Yik B. J. and Dood A. J., (2023), Development of a Generalizable Framework for Machine Learning-Based Evaluation of Written Explanations of Reaction Mechanisms from the Postsecondary Organic Chemistry Curriculum, in Graulich N. and Shultz G. V. (ed.), Student Reasoning in Organic Chemistry, The Royal Society of Chemistry, pp. 304–319.
  105. Rich E., Knight K. and Nair S. B., (2009), Artificial Intelligence, McGraw-Hill.
  106. Riconscente M. M., Mislevy R. J. and Corrigan S., (2015), Evidence-Centered Design, in Lane S., Raymond M. R. and Haladyna T. M. (ed.), Handbook of Test Development, New York, NY: Taylor & Francis/Routledge, vol. 2, pp. 40–63.
  107. Rosenberg J. M. and Krist C., (2021), Combining Machine Learning and Qualitative Methods to Elaborate Students’ Ideas About the Generality of their Model-Based Explanations, J. Sci. Educ. Technol., 30, 255–267.
  108. Rupp A. A., Levy R., Dicerbo K. E., Sweet S. J., Crawford A. V., Caliço T., Benson M., Fay D., Kunze K. L., Mislevy R. J. and Behrens J. T., (2012), Putting ECD into Practice: The Interplay of Theory and Data in Evidence Models within a Digital Learning Environment, J. Educ. Data Min., 4, 49–110.
  109. Russ R. S., Scherr R. E., Hammer D. and Mikeska J., (2008), Recognizing Mechanistic Reasoning in Student Scientific Inquiry: A Framework for Discourse Analysis Developed From Philosophy of Science, Sci. Educ., 92, 499–525.
  110. Samuel A. L., (1959), Some Studies in Machine Learning Using the Game of Checkers, IBM J. Res. Dev., 3, 211–229.
  111. Sao Pedro M. A., de Baker R. S. J., Gobert J. D., Montalvo O. and Nakama A., (2013), Leveraging machine-learned detectors of systematic inquiry behavior to estimate and predict transfer of inquiry skill, User Model. User-Adapt. Interact., 23, 1–39,  DOI:10.1007/s11257-011-9101-0.
  112. Sevian H. and Talanquer V., (2014), Rethinking chemistry: a learning progression on chemical thinking, Chem. Educ. Res. Pract., 15, 10–23.
  113. Sherin B., (2013), A Computational Study of Commonsense Science: An Exploration in the Automated Analysis of Clinical Interview Data, J. Learn. Sci., 22, 600–638.
  114. Songer N. B. and Ruiz-Primo M. A., (2012), Assessment and Science Education: Our Essential New Priority? J. Res. Sci. Teach., 49, 683–690.
  115. Southard K., Wince T., Meddleton S. and Bolger M. S., (2016), Features of Knowledge Building in Biology: Understanding Undergraduate Students' Ideas about Molecular Mechanisms, CBE Life Sci. Educ., 15, ar7.
  116. Stowe R. L. and Cooper M. M., (2017), Practicing What We Preach: Assessing “Critical Thinking” in Organic Chemistry, J. Chem. Educ., 94, 1852–1859.
  117. Stowe R. L., Scharlott L. J., Ralph V. R., Becker N. M. and Cooper M. M., (2021), You Are What You Assess: The Case for Emphasizing Chemistry on Chemistry Assessments, J. Chem. Educ., 98, 2490–2495.
  118. Talanquer V., (2009), On Cognitive Constraints and Learning Progressions: The case of “structure of matter”, Int. J. Sci. Educ., 31, 2123–2136.
  119. Tansomboon C., Gerard L. F., Vitale J. M. and Linn M. C., (2017), Designing Automated Guidance to Promote Productive Revision of Science Explanations, Int. J. Artif. Intell. Educ., 27, 729–757.
  120. Urban-Lurain M., Moscarella R. A., Haudek K. C., Giese E., Sibley D. F. and Merrill J. E., (2009), Beyond Multiple Choice Exams: Using Computerized Lexical Analysis to Understand Students' Conceptual Reasoning in STEM Disciplines, Presented in part at the 2009 IEEE Frontiers in Education Conference, San Antonio, TX.
  121. Urban-Lurain M., Moscarella R. A., Haudek K. C., Giese E., Merrill J. E. and Sibley D., (2010), Insight into Student Thinking in STEM: Lessons Learned from Lexical Analysis of Student Writing, Presented in part at the National Association of Research in Science Teaching Annual Conference, Philadelphia, PA.
  122. Urban-Lurain M., Prevost L., Haudek K. C., Henry E. N., Berry M. and Merrill J. E., (2013), Using Computerized Lexical Analysis of Student Writing to Support Just-in-Time Teaching in Large Enrollment STEM Courses, Presented in part at the 2013 IEEE Frontiers in Education Conference, Oklahoma City, OK.
  123. van Mil M. H. W., Postma P. A., Boerwinkel D. J., Klaasen K. and Waarlo A. J., (2016), Molecular Mechanistic Reasoning: Toward Bridging the Gap Between the Molecular and Cellular Levels in Life Science Education, Sci. Educ., 100, 517–585.
  124. Vitale J. M., Lai K. and Linn M. C., (2015), Taking advantage of automated assessment of student-constructed graphs in science, J. Res. Sci. Teach., 52, 1426–1450.
  125. Vitale J. M., McBride E. and Linn M. C., (2016), Distinguishing complex ideas about climate change: knowledge integration vs. specific guidance, Int. J. Sci. Educ., 38, 1548–1569.
  126. Wang C., Liu X., Wang L., Sun Y. and Zhang H., (2021), Automated Scoring of Chinese Grades 7–9 Students' Competence in Interpreting and Arguing from Evidence, J. Sci. Educ. Technol., 30, 269–282.
  127. Watts F. M., Zaimi I., Kranz D., Graulich N. and Shultz G. V., (2021), Investigating students’ reasoning over time for case comparisons of acyl transfer reaction mechanisms, Chem. Educ. Res. Pract., 22, 364–381.
  128. Watts F. M., Dood A. J. and Shultz G. V., (2023), Developing machine learning models for automated analysis of organic chemistry students' written descriptions of organic reaction mechanisms, in Graulich N. and Shultz G. V. (ed.), Student Reasoning in Organic Chemistry, The Royal Society of Chemistry, pp. 285–303.
  129. Watts F. M., Park G. Y., Petterson M. N. and Shultz G. V., (2022), Considering alternative reaction mechanisms: students’ use of multiple representations to reason about mechanisms for a writing-to-learn assignment, Chem. Educ. Res. Pract., 23, 486–507.
  130. Wehbe L., Murphy B., Talukdar P., Fyshe A., Ramdas A. and Mitchell T. M., (2014), Simultaneously Uncovering the Patterns of Brain Regions Involved in Different Story Reading Subprocesses, PLoS One, 9, e112575.
  131. Williamson D. M., Xi X. and Breyer F. J., (2012), A Framework for Evaluation and Use of Automated Scoring, Educ. Meas, 31, 2–13.
  132. Winograd B. A., Dood A. J., Finkenstaedt-Quinn S. A., Gere A. R. and Shultz G. V., (2021a), Automating Characterization of Peer Review Comments in Chemistry Courses, in Proceedings of the 14th Computer-Supported Collaborative Learning (CSCL), Bochum, Germany, pp. 11–18.
  133. Winograd B. A., Dood A. J., Moon A., Moeller R., Shultz G. V. and Gere A. R., (2021b), Detecting High Orders of Cognitive Complexity in Students' Reasoning in Argumentative Writing About Ocean Acidification, in Proceedings of the LAK21: 11th International Learning Analytics and Knowledge Conference, New York, NY, pp. 586–591.
  134. Wood D., Bruner J. S. and Ross G., (1976), The role of tutoring in problem solving, J. Child Psychol. Psychiatry, 17, 89–100.
  135. Wulff P., Buschhüter D., Westphal A., Mientus L., Nowak A. and Borowski A., (2022a), Bridging the Gap Between Qualitative and Quantitative Assessment in Science Education Research with Machine Learning—A Case for Pretrained Language Models-Based Clustering, J. Sci. Educ. Technol., 31, 490–513.
  136. Wulff P., Mientus L., Nowak A. and Borowski A., (2022b), Utilizing a Pretrained Language Model (BERT) to Classify Preservice Physics Teachers’ Written Reflections, Int. J. Artif. Intell. Educ., 1–28,  DOI:10.1007/s40593-022-00290-6.
  137. Yik B. J., Dood A. J., Cruz-Ramírez de Arellano D., Fields K. B. and Raker J. R., (2021), Development of a machine learning-based tool to evaluate correct Lewis acid-base model use in written responses to open-ended formative assessment items, Chem. Educ. Res. Pract., 22, 866–885.
  138. Yik B. J., Dood A. J., Frost S. J. H., Cruz-Ramírez de Arellano D., Fields K. B. and Raker J. R., (2023), Generalized rubric for level of explanation sophistication for nucleophiles in organic chemistry reaction mechanisms, Chem. Educ. Res. Pract., 24, 263–282.
  139. Zhai X., (2019), Call for Papers: Applying Machine Learning in Science Assessment: Opportunity and Challenge, J. Sci. Educ. Technol., 1–3,  DOI:10.13140/RG.2.2.10914.07365.
  140. Zhai X., (2021), Practices and Theories: How Can Machine Learning Assist in Innovative Assessment Practices in Science Education, J. Sci. Educ. Technol., 30, 139–149.
  141. Zhai X., Haudek K. C., Shi L., Nehm R. H. and Urban-Lurain M., (2020a), From substitution to redefinition: a framework of machine learning-based science assessment, J. Res. Sci. Teach., 57, 1430–1459.
  142. Zhai X., Haudek K. C., Stuhlsatz M. A. M. and Wilson C. D., (2020b), Evaluation of Construct-Irrelevant Variance Yielded by Machine and Human Scoring of a Science Teacher PCK Constructed Response Assessment, Stud. Educ. Eval., 67, 100916.
  143. Zhai X., Yin Y., Pellegrino J. W., Haudek K. C. and Shi L., (2020c), Applying machine learning in science assessment: a systematic review, Stud. Sci. Educ., 56, 111–151.
  144. Zhai X., Haudek K. C. and Ma W., (2022a), Assessing Argumentation Using Machine Learning and Cognitive Diagnostic Modeling, Res. Sci. Educ., 1–20,  DOI:10.1007/s11165-022-10062-w.
  145. Zhai X., He P. and Krajcik J., (2022b), Applying machine learning to automatically assess scientific models, J. Res. Sci. Teach., 1–30,  DOI:10.1002/tea.21773.
  146. Zhai X., Shi L. and Nehm R. H., (2021), A Meta-Analysis of Machine Learning-Based Science Assessments: Factors Impacting Machine-Human Score Agreements, J. Sci. Educ. Technol., 30, 361–379.
  147. Zhu M., Lee H.-S., Wang T., Liu O. L., Belur V. and Pallant A., (2017), Investigating the impact of automated feedback on students' scientific argumentation, Int. J. Sci. Educ., 39, 1648–1668.

This journal is © The Royal Society of Chemistry 2023