I guess it was more than just my general knowledge of chemistry”: exploring students’ confidence judgments in two-tiered assessments

Casandra Koevoets-Beach , Karen Julian and Morgan Balabanoff *
Department of Chemistry, University of Louisville, Louisville, KY, USA. E-mail: morgan.balabanoff@louisville.edu

Received 1st June 2023 , Accepted 3rd August 2023

First published on 15th August 2023


Abstract

Two-tiered assessment structures with paired content and confidence items are frequently used within chemistry assessments to stimulate and measure students’ metacognition. The confidence judgment is designed to promote students’ reflection on their application of content knowledge and can be characterized as calibrated or miscalibrated based on their accuracy. Previous studies often attributed students’ miscalibrated confidence rankings to metaignorance, however, in this qualitative study, interviews with general chemistry students were thematically analysed to provide a more robust understanding of the processes and factors students use when engaging with these metacognitive prompts in a chemistry assessment. Both calibrated and miscalibrated confidence judgments were observed independent of accuracy. Students who provided miscalibrated confidence judgments often used unreliable metrics such as processing fluency which can mimic content mastery whereas students who provided more accurate evaluations of their confidence relied more heavily on their stable understanding of chemistry concepts. Many students cited previous experiences, underlying self-efficacy beliefs, and/or the use of test-taking strategies which negatively or positively impacted their confidence. These findings suggest that the confidence tier is indeed capturing students’ self-assessment, however, students’ confidence judgments are based on a range of factors independent of content knowledge which may impede on the utility of this metacognitive tool for students, researchers, and instructors.


Introduction

Metacognition is often referred to as “thinking about thinking”, or the higher order level of thinking which involves active control over the cognitive processes engaged in learning and assessing if a cognitive goal has been met (Flavell, 1979; Livingston, 2003). The ability to apply and problem solve using chemistry concepts requires the conscious interrelation of various chemical phenomena (Mahaffy, 2004), which has directed interest towards interventions and assessments targeting metacognitive skills in chemistry learners. One such tool which has become commonly used in CER is the confidence judgment as a secondary tier in assessments evaluating conceptual chemistry knowledge.

In these two-tier assessments, confidence judgments have been paired with content questions and used to investigate the relationship between students’ chemistry content knowledge and their confidence in that knowledge. In addition to investigating this relationship, the confidence tier has also served as a way for students to practice self-assessment and strengthen metacognitive skill by providing an opportunity to reflect on their conceptual understanding. Having a strong grasp of their level of conceptual understanding is a critical skill for students as they continually encounter new topics over the course of their STEM education (NGSS Lead States, 2013).

While the confidence tier engages students in this important practice and the relationship between confidence and content knowledge has been used to make claims about students’ ability to self-assess, the range of factors students use to evaluate their confidence has yet to be investigated in the context of chemistry assessments. Without a robust understanding of how students are engaging in the confidence tier, there may be limitations on the claims that can be made regarding the relationship between students’ chemistry knowledge and confidence.

In the present study, semi-structured interviews allowed students to describe their thought processes and the factors which impacted their confidence rankings. This work aims to elicit the variation in students’ decision-making processes when evaluating confidence in their own content knowledge. Equipped with a more complete understanding of the factors that students are using in these contexts, instructors can more ably promote reflective learning and provide targeted support. This work seeks to underpin improved measurement of the construct of confidence so that claims can be better understood and utilized by researchers.

Background

Metacognition

Metacognition, or “thinking about thinking”, consists of two major components: metacognitive knowledge, i.e., what a person knows about their own thinking processes, and metacognitive regulation, i.e., strategies used to control one's thinking and learning (Jacobs and Paris, 1987). Improving metacognitive regulation is often the target of interventions, as it encompasses the skills of planning, monitoring and evaluation (Schraw and Moshman, 1995) which are crucial for effective problem solving (Davidson et al., 1994).

Students’ metacognitive skill when monitoring their learning processes has been shown to improve effective regulation of study and assessment performance (Thiede et al., 2003). Other studies have shown that learning outcomes improve when students’ metacognition is improved through in-class interventions (Thiede et al., 2011; Lavi et al., 2019) and that high performance on problem-solving tasks is more closely related to performance on metacognitive measures than aptitude measures (Swanson, 1990).

Growing literature in science education has strengthened claims that improved metacognition can enhance science literacy, teaching, and learning (Adey et al., 1989; Davis, 1996; Blank, 2000; Georghiades, 2000, 2004). The expectation for students to consciously interrelate chemical phenomena to solve problems in chemistry courses (Mahaffy, 2004) has prompted research on interventions and assessments targeting metacognitive skills in chemistry learners specifically. Studies of undergraduate chemistry students have found that explicit metacognitive regulation can improve learning performance in chemistry courses (Cook et al., 2013; Casselman and Atwood, 2017; Dori et al., 2018). However, implicit metacognitive monitoring alone was not found to improve chemistry students’ learning performance (Hawker et al., 2016), reinforcing the need to explicitly prompt metacognitive regulation.

Students engaging in metacognition are expected to reflect on the techniques and processes employed to engage in problem solving and make final judgments. Heuristics are frequently used tools which students employ to reduce their information-processing load during problem solving (Gigerenzer and Todd, 1999; Gilovich et al., 2002). Specific heuristics, or forms of intuitive reasoning, relevant to chemistry learners have been cited as particularly threatening to the deep, conceptual chemical thinking that is essential in chemistry education (Talanquer, 2014).

An affect heuristic is used to make judgments and decisions based on feelings evoked from the information at hand (Finucane et al., 2000). Evaluation of chemistry students’ conceptual understanding and the factors students use to self-evaluate have also been shown to correlate with affective variables including self-concept and situational interest (Nieswandt, 2007). The affect heuristic is operationalized as a decision-making tool which impacts choices made by college students regarding synthetic substances and chemical processes (Rozin, 2005). Problem-solving and reasoning in chemistry is often hindered by the use of this type of heuristic (Talanquer, 2014).

Another heuristic often used is processing fluency, a metacognitive experience determined by the relative ease or difficulty demanded by a cognitive process. Use of this mental shortcut often leads students to choose responses based on features which are processed fastest (i.e., most fluently) (Heckler and Scaife, 2015).

The use of test-taking strategies independent of content knowledge in chemistry problem-solving can result in applications of a cognitive rigidity heuristic. This heuristic, which is often applied by novice learners who utilize rigid problem-solving algorithms, leads students to fall back on processes or solutions that have worked for them in the past without recognizing how to use the presented information (Gabel and Bunce, 1994).

Explicit metacognitive regulation through meaningful engagement with students in their chemistry learning environments may mitigate the effects of these heuristics on chemistry students’ problem-solving processes by prompting the development of more productive chemical thinking (Talanquer, 2014).

Calibration

Students utilize an arsenal of heuristics, content knowledge, and other tools to problem solve and make self-judgments, so when asked to engage in metacognition, they are asked to reflect on how accurately they employed their tools to choose a response. The agreement between this self-judgment and students’ actual performance in the problem-solving process is referred to as calibration. Calibration measures the degree to which one's perception of performance corresponds to actual performance (Yates, 1990; Keren, 1991; Nietfeld et al., 2006). The degree of calibration between students’ actual and perceived performance can therefore be considered evidence of metacognitive skill and consequently be used to regulate learning behaviours. For example, students who can accurately assess the extent of their knowledge should be better positioned to intensify or redirect their study focus or generate self-feedback (Hacker et al. 2008). Alternatively, students who are not able to align their metacognitive reflection to their ability may not be able to regulate their self-feedback to promote productive study or learning behaviours.

In assessment development, calibration or miscalibration of conceptual knowledge is often measured through self-assessments of individuals’ accuracy in cognitive domains by prompting students to assess the correctness of their responses (Pallier et al., 2002). The agreement between self-assessments and actual measured accuracy are then utilized to make claims regarding calibration and metacognitive skill.

Metacognitive self-assessment outcome scores may take different forms, from subjective probabilities via rating confidence on a 1–100 scale, to dichotomous predictions of whether a performance was successful or unsuccessful, to a Likert-style scale coded to reflect degrees of confidence (Schraw, 2009). Quantitatively, these outcome scores have been used to measure the goodness of fit of these judgments, which may be done through measurement of accuracy, scatter, discrimination, and/or bias scores (Yates, 1990; Keren, 1991; Nelson, 1996; Allwood et al., 2005; Burson et al., 2006).

Each measure provides different types of information which complement one other and none of which is well suited for all situations (Schraw, 2009). For example, bias scores reflect the correspondence between personal assessment of accuracy and an empirical result, where a positive bias score represents overconfidence, negative bias score represents underconfidence, and a bias score of zero indicates accurate assessment (Pallier et al., 2002). Investigators have attempted to provide explanations for such over- and underconfidence phenomena through two prominent approaches: the heuristics and biases approach, and the ecological approach.

The heuristics and biases approach claims that confidence miscalibration occurs due to general cognitive biases, heuristics, or a combination of both, which facilitate intuitive judgments (Tversky and Kahneman, 1996). The ecological approach, on the other hand, suggests that cues which are used to solve cognitive problems are provided by environmental knowledge (Gigerenzer, 1991). This approach posits that both the individual's response and confidence in that response are generated by the same environmental cue (Gigerenzer, 1991). These two approaches may be interpreted as attributing miscalibration to either environmental or personal influences, however the reciprocal interaction of these factors and behavioural influences as outlined by the social cognitive theory (SCT) (Fig. 1) provides a more robust and holistic explanation for students’ metacognitive calibration (Bandura, 1986).


image file: d3rp00127j-f1.tif
Fig. 1 Reciprocal model of influences which impact learning outcomes as outlined by Bandura's social cognitive theory (SCT).

Miscalibration between accuracy and self-judgment is a well-documented psychological occurrence termed the Dunning–Kruger effect which characterizes the phenomenon of illusory competence, or metaignorance, a cognitive bias where individuals’ perceived ability exceeds their actual ability (Kruger and Dunning, 1999). This example of miscalibration has been recognized across disciplines and identified as a robust phenomenon in introductory chemistry courses (Bell and Volckmann, 2011; Pazicni and Bauer, 2014). In introductory chemistry, students with mistakenly high self-ratings were resistant to traditional forms of feedback, i.e., low exam scores, and were likely unaware of the need to take corrective steps to improve their performance (Pazicni and Bauer, 2014). Strengthening students’ metacognitive skills through providing opportunities to practice self-assessment has been shown to have a positive effect on student performance calibration (Nietfeld et al., 2005).

Assessment of metacognition

To assess metacognition, developers can choose which metacognitive measurement best evaluates their target construct. Of note, either self-efficacy or confidence judgments can be utilized to target metacognition in assessment settings, however confidence judgments are more frequently observed in CER literature. The difference between their targets is nuanced and worth considering in the present work.

Framed by SCT, self-efficacy (SE) refers to “beliefs in one's capabilities to organize and execute the courses of action required to produce given attainments” (Bandura, 1977). SE can affect students’ motivational outcomes (Bandura, 1997), as learners who feel efficacious about learning are likely to engage in cognitive and behavioural activities that improve their learning. SE judgments are not, however, generalized feelings of success, but rather assess the level, generality, and strength of one's ability to act in pursuit of a designated goal. SE judgments can be measured using questionnaire items that are task specific, vary in difficulty, and capture degrees of confidence (Zimmerman, 2000). For example, students’ SE on a set of twenty cognitive puzzles could be assessed via the prompt: “Judge how many items of the previous task that you think you were capable of solving” (Cervone and Peake, 1986) which encompasses all three of Bandura's requirements.

Alternatively, confidence judgments are prompts which specifically target an individual's accuracy in a cognitive domain and are often presented as subjective probabilities of a successful outcome (Zimmerman et al., 1977). Where SE judgments more broadly encompass one's belief in their ability to act in pursuit of successfully completing a task, confidence judgments more closely focus on accuracy within a single domain. These question types both present opportunities for students to engage in targeted metacognitive reflection and may be used in two-tier assessments for chemistry students, the distinction merely lies in the focus of the self-assessment.

Two-tier assessments initially took shape as diagnostic instruments utilizing two-tier multiple-choice items to identify students’ knowledge of a scientific concept in tier 1 and their reasoning for a concept in tier 2 (Treagust, 1986). These assessment items have been presented as an alternative to more arduous qualitative processes to determine students’ understanding and identify alternate conceptions in limited and clearly defined ways (Treagust, 1988, 1995). The two-tier assessment format was adapted to include metacognitive tools with the certainty of response index (CRI), developed to gauge individuals’ certainty in their response as a paired, follow-up question. CRI was developed in light of growing literature in the social sciences which promoted higher effectiveness of immediate vs. delayed feedback on skill performance (Webb et al., 1994). CRI was utilized by Hasan et al., in a physics education assessment development study which sought to identify strongly held alternate cognitive structures regarding classical mechanics (Hasan et al., 1999). The “Hasan hypothesis” posits that high certainty-incorrect response miscalibration may identify students with strongly held alternate conceptions, and that low certainty-correct response miscalibration may identify guessing on assessment items (Hasan et al., 1999).

As more robust metacognitive judgment literature emerged and the construct of response confidence was clearly defined (Pallier et al., 2002; Allwood et al., 2005; Burson et al., 2006; Schraw, 2009), the Dunning–Kruger “metaignorance” phenomenon was identified and broadly identified across groups (Kruger and Dunning, 1999). In response to the NGSS call for improvement of metacognitive skill in science learners (NGSS Lead States, 2013) and work which identified significant presence of the Dunning–Kruger phenomenon in introductory chemistry students (Pazicni and Bauer, 2014), two-tier assessment formats pairing confidence judgments with content questions have been increasingly used by researchers who seek to target chemistry learners’ metacognition.

Assessments and concept inventories targeting representational competence (Connor et al., 2021); fundamental pH concepts (Watson et al., 2020); redox reactions (Brandriet and Bretz, 2014); reaction coordination diagrams (Atkinson et al., 2020); as well as enthalpy and entropy (Abell and Bretz, 2019) all utilize two-tier confidence judgments. The use of this confidence/calibration data has been used to distinguish clusters of students whose data are similar to one another and dissimilar to those in other clusters (Han et al., 2012), which allows developers to cluster students based on calibration categories.

Improvement of student self-efficacy has been identified as a target to promote well-rounded and conceptualized chemistry knowledge (Dalgety and Coll, 2006; Kan and Akbaş, 2006; Villafañe et al., 2014; Ferrell and Barbera, 2015; Avargil, 2019), however two-tier assessments utilizing confidence tiers have largely been used as tools to identify metaignorance in chemistry learners who may benefit from interventions (McClary and Bretz, 2012; Brandriet and Bretz, 2014; Connor et al., 2021). The confidence judgment tool provides a measure of chemistry learners’ calibration of accuracy to confidence, delivering feedback for instructors to identify whether students are calibrated. It also allows for targeted instruction and provides students the opportunity to practice the skill of metacognitive monitoring and engage in comparative benchmarking with peers (Farh and Dobbins, 1989; Clinchot et al., 2017).

Research questions

The growing body of assessment literature in chemistry education research has emphasized the importance of presenting evidence that newly developed instruments provide valid and reliable measurements of specific constructs through increasingly rigorous psychometric analysis (Barbera and VandenPlas, 2011; Heredia and Lewis, 2012). Evidence is provided by researchers to make claims regarding content and construct validity of chemistry content assessment items, and to assert a test's consistent performance across groups. Often, interview evidence is presented within target populations to evaluate items and/or instruments for response process validity (Arjoon et al., 2013; Wren and Barbera, 2013). In the case of confidence rankings used in two-tiered chemistry assessments, however, it is unclear how students are interpreting the confidence tier. Further exploration is needed targeting the complex reasoning students use to select a confidence response and the array of factors students use to engage in metacognition. Data collection via response process validity interviews and subsequent analysis of this project were therefore guided by the following research questions:

RQ1. Which strategies, factors, and thought processes do students describe while ranking their confidence on a General Chemistry assessment?

RQ2. What relationships can be observed between correctness of students’ answers and reasoning for their confidence judgments?

Theoretical framework

This research was framed using Bandura's SCT, which posits that learning is affected by reciprocal interactions between an individual's behaviours, their internal personal factors (i.e., cognitions and emotions), and external environmental events (Bandura, 1986, 1997). Each set of influences (behavioural, personal, and environmental) affects the other and is in turn affected by them (Fig. 1). In this model, motivational processes such as self-efficacy and self-comparisons are categorized as types of personal influences, feedback and graded assessments from instructors are categorized as types of environmental influences, and activities which promote learning such as attending lectures and expending study efforts are categorized as types of behavioural influences (Schunk and DiBenedetto, 2020). Bandura's social cognitive theory included the concept of self-regulated learning (SRL), which was applied to classroom learning, among other settings. SRL consists of three main components: cognition, metacognition, and motivation. The current study focuses on the facet of metacognition as a target to improve SRL in chemistry students.

Methods

The study design and participant recruitment were approved by the Institutional Review Board at the University of Louisville (IRB#: 22.0454) and at the University of Nebraska-Lincoln (IRB#: 22.0118).

Participants

A convenience sampling approach was used for this study within two large, public, research-intensive universities. The sample pool consisted of students who had completed the two-semester sequence of General Chemistry and completed the Water Instrument assessment for General Chemistry (Balabanoff et al., 2022). Interview participants were recruited at the end of the assessment and were compensated for their time with a gift card. Over six semesters, 1638 students completed the assessment and 22 students participated in individual cognitive interviews—16 from Institution A, and 6 from Institution B. Preliminary interviews were conducted in Fall 2019 and subsequent interviews took place in Spring 2020, Spring 2021, Spring 2022, and Fall 2022 semesters. Each participant was assigned a pseudonym and identifying information was removed from the data to protect participants’ identities.

Data collection

Preliminary interviews were collected in Fall 2019 as part of an assessment development study investigating validity and reliability of a general chemistry assessment which utilized a two-tier question format (Balabanoff et al., 2022). Response process validity (RPV) interviews were carried out to collect novice content validity evidence for assessment items. These preliminary interviews targeted response process validity for content items based on Tourangeau's four-stage cognitive model (Tourangeau, 1984; Peterson et al., 2017). During interviews, students were asked to explain their thought process as they answered selected items they previously had seen on the assessment. Some participants independently reflected on their confidence in their thinking processes during these interviews potentially due to their exposure to the confidence tier in the assessment setting.

Beginning in Spring 2020 and for all subsequent interviews, the RPV interview protocol for the research team was expanded to explicitly include probes exploring students’ response processes for their confidence judgments. Similar to the preliminary interviews, students were shown items from the assessment which served as a frame of reference for confidence judgments. For this study, the data for both content and confidence response processes were analysed, and relevant data from preliminary interviewees who discussed factors impacting their confidence without explicit prompting has been included.

The interview protocol utilized from Spring 2020 forward included questions targeting responses to content items as well as additional interview items directed at participants’ perspectives and reasons for their confidence rankings. The interviews were semi-structured and designed to prompt students to describe the techniques they used to answer questions. Students were presented with a content question first and asked to describe the process they used to answer. They were then presented with the Likert scale-ranked confidence tier and asked to explain their chosen confidence ranking (Fig. 2). Prompts in the interview protocol asked students to expand, describe, or connect strands presented between both their content and confidence response processes. Students repeated this process for 3–6 items depending on the length of the problems to keep interview length similar.


image file: d3rp00127j-f2.tif
Fig. 2 Example of two-tier item with paired content and confidence.

Interviews were audio- and video-recorded, transcribed using Otter.ai, and cleaned by the research team to confirm accuracy.

Data analysis

The interview data were evaluated using MAXQDA qualitative analysis software. Each content item was coded dichotomously as correct or incorrect, and the selected confidence tier option was coded as either average, high, or low confidence. The five Likert-scale options were nested into three to simplify classification of confidence responses. The confidence ranking scale choices of “very well” and “well” fused into High Confidence and the confidence levels of “poorly” and “very poorly” were combined into the code Low Confidence. Subsequent open coding explored students’ reasoning for their confidence rankings, looking for recurring patterns and themes that could be expanded into distinct code categories describing observed student behaviours.

The qualitative codebook was structured to incorporate evolving codes as a result of the research team's recognition of new patterns and themes during data analysis (Miles et al., 2014). This resulted in segments being adjusted and recoded. As central themes were identified by the researchers, codes were grouped and recategorized utilizing constant comparative techniques (Glaser, 1965). Knowledge-related codes, confidence-related codes, and question-related codes were categories that emerged during this open coding process (Fig. 3).


image file: d3rp00127j-f3.tif
Fig. 3 Evolution of coding scheme from open coding to final code book.

Codes and categories which emerged from open coding were then further refined during the subsequent axial coding process. The impact of increasingly frequent codes on confidence was assessed, which led to the alteration of certain codes for clarity. An example of this occurrence was the separation of the Confidence category into Self Reflection and Time Dependence to distinguish differences in confidence trends between instances in which students described their level of confidence or ability to respond in terms of their own judgments or experiences versus when they described their processing fluency.

Connections and relationships observed between various codes in one transcript were documented and evaluated by analysing their recurrence in the other transcripts. These relationships were frequently identified between codes pertaining to confidence level and correctness and codes associated with test-taking strategies and/or heuristics. This process was streamlined through use of MAXQDA's Complex Coding Query feature.

The final step of the coding process involved defining overarching themes which were used to categorize the set of codes from the previous axial coding process. These emergent themes focused on content knowledge, self-reflection, time dependence, and test-taking strategies.

Throughout the process, the first and second authors independently coded the same transcripts before meeting to discuss and refine until agreement was reached. Intermittent check-ins with the third author to confirm and clarify emerging codes allowed for consensus to be reached by the research team. Each final code in the code book was confirmed to be present in at least two transcripts at each institution. The lowest frequency code observed was “Peer Comparison”, observed in transcripts of four students (two at each institution), and the highest frequency code was “Correctly Applied Conceptions”, present in all twenty-two interviews analysed. The final code book was agreed upon by the entire research team and any transcripts utilizing old codes were re-coded by the first and second authors to align with the final code book and revised until agreement among all authors was reached (Table 1).

Table 1 Final thematic code book derived through constant comparative qualitative analysis of interview data
Code Description
Content knowledge
Correctly applied conceptions Correctly applied a learned concept which was well-conceptualized
Incorrectly applied conceptions Utilized and described an alternate, incorrect conception
Self-identified gaps Describes a concept that they believe they should understand or know but do not
Personal influences
Past experiences Description of a past experience when describing confidence
Self-judgment Provides an overarching, big picture evaluation of own abilities
Comparing confidence across items Compared confidence on a specific item to confidence on previous item(s) in the assessment
Peer comparison Compares own performance to perception of their classmates’ abilities
Time dependence
Retrieval ease Describes their own efficient recall of information to answer question
Retrieval difficulty Describes longer time needed to retrieve information to be able to answer
Test-taking strategies
Identifying key words Identified key words for easily activated concepts and/or other relevant information
Elimination Describes process of eliminating options to select a final answer
Guessing Mentions a selection choice based on a guess rather than content


Results and discussion

The goals for analysing interview data were to gain insight into how students metacognitively reflect on their knowledge of general chemistry assessment items and determine how this metacognitive assessment tool is being interpreted by students. The confidence tier targets students’ subjective judgment of success in their application of content knowledge to answer content questions, which narrows the focus for these self-assessments. The target construct for the confidence judgments can therefore be defined as students’ identification and metacognitive reflection on their use of general chemistry content knowledge.

Analysis resulted in the characterization of a broad scope of factors used by students to make metacognitive confidence judgments for chemistry assessment items. Students’ application of content knowledge was indeed identified as a factor used to rank confidence; however, extensive use of affective judgments, time-based judgments, and deployment of test-taking strategies were also observed in interviews by the research team. Selected student quotes that exemplify emergent themes are discussed across each research question.

Factors impacting confidence (RQ1)

Content knowledge.
Correctly applied conceptions. During interviews, students demonstrated significant awareness of the impact that content knowledge mastery had on their confidence judgments. Students could correctly apply a learned concept that was well-conceptualized and familiar to answer a content question and reflect on their understanding. One example is when Ted engaged with an item that asked him to identify molecular-level properties of water remaining consistent across different phases.

I would say I can answer this one wellMy ranking is more so based on the concept that the question was asking me about this time. Phase changes have been covered since seventh-grade chemistry, I think it's been hammered. It got a bit specific talking about the bond lengths and density, temperature, but that's all pretty familiar to me.”

(Ted)

Ted's high confidence ranking was attributed to a strong conceptual understanding of chemistry content which he developed over time because of repeat exposures. He explicitly stated that his confidence ranking for this question was “based on the concept that the question was asking”, indicating that he considered how well he understood the target concept when reflecting on his accuracy. Ted's application of his conceptual understanding to answer this specific question led to a higher confidence judgment.

Despite many observable examples of students using content knowledge as a measure of confidence, the ability to correctly apply general chemistry concepts did not consistently correspond to high confidence judgments. When students answered correctly but did not describe content knowledge as impacting their confidence, their rankings were consistently average or low. Factors such as requiring extended time to answer, and/or dependence on test-taking strategies often led students to express doubt and ultimately rank their confidence as average or low despite answering correctly. For example, Sharon correctly applied chemistry concepts to eliminate multiple response options designed as distractors, however, she ranked her confidence as average. Here, Sharon considered dispersion forces:

Dispersion forces, I know that those are typically found in basically all molecules and compounds. So, they’re not permanent. I crossed out everything that said the word ‘permanent’ in here because dispersions are temporary. They’re not the same as induced dipole which changes based on the orientation of the molecules, I think.

(Sharon)

After discussing the question content, Sharon went on to judge her confidence as average, explaining:

I remember the big topic things. Whether I remember which definition goes to which, that's what's kind of up in the air. That's why I put average, because I might have gotten this right and I probably know what I'm talking about, but maybe I just picked the wrong one.

(Sharon)

Sharon's doubt regarding how well she remembered a specific chemistry concept was linked to her use of elimination strategies which lengthened her time to decide on an answer. When explaining her reasons for her confidence judgment Sharon did not self-identify her working understanding of intermolecular forces which provided the foundation for her response. In this example, Sharon's metacognitive confidence judgment based on her knowledge was hindered by self-doubt and other interfering factors.


Incorrectly applied conceptions. Not all students were able to correctly apply chemistry knowledge in interviews. Some utilized and described incorrect conceptions which corresponded to answering content questions incorrectly. Sometimes these students were even able to identify correct concepts in more than one multiple-choice option but still ultimately chose an answer which contained an alluring alternate conception. For example, Gretchen described how she considered two possible explanations of why bonds form in a single water molecule, where Option B stated that “H and O attract because oxygen needs to fulfil its octet.”

I think it's [Option] B, probably. The reason that things form bonds is because they want to satisfy their octet, to be more stable […] The other choices were not as correct to me, like [Option] C says, ‘H and O attract because the valence electrons of each atom are attracted to the nucleus of the other’. I know that's true, but I feel like B is more true because that's why they want to bond.

(Gretchen)

Gretchen correctly identified that the correct response (Option C) was a valid explanation, demonstrating a degree of content knowledge, but misattributed the octet rule as the driving reason for bond formation. She went on to make a confidence judgment which reflected her struggle to fully reject either Option B or C:

I guess average, but I’m still a little confused. I don't know. If I reviewed it before taking the test, I would probably be better at answering this. I generally know the concept and would be comfortable explaining it to someone, but I could also imagine saying ‘Oh, I’m not sure, let me check the book.’”

(Gretchen)

Alternate conceptions like the previous example in which students attribute bond formation to the teleological explanation of atoms “wanting to fulfil an octet” are often held due to instructors’ use of simplified language when introducing concepts for the first time (Bodner, 1986; Talanquer, 2014). These conceptions can be strongly held, resistant to change, and often persist over time (Bodner 1991), which makes them even more difficult to metacognitively assess for those with lower metacognitive skill.

A deficit in metacognitive skill was observed in students who confidently applied such alternate conceptions to assessment items. Students were able to construct thorough arguments for incorrect response options using alternate conceptions which often resulted in more confident judgments. Gloria explained her thought process when she was asked to choose which free energy diagram most appropriately described the formation of water from its constituent elements.

In the [chemical] equation there's hydrogen and oxygen separately and then they come together so that would be a decrease in entropy. Then we just have to think decreasing entropy means nonspontaneous reaction and then nonspontaneous means the ΔG is going to be greater than zero which would mean it's an endothermic reaction. Which means products will be higher energy than reactants. That's how I picked C.

(Gretchen)

This explanation demonstrates another common alternate conception frequently held by students, that a decrease in entropy indicates that a reaction is non-spontaneous under all conditions (Teichert and Stacy, 2002). Gloria went on to rank her confidence as high based on her perception of piecing multiple ideas together.

I feel pretty confident in my answer being right. I just think about how much information I could remember, what the teacher said about it, or that I wrote it down in my notes to decide if I felt confident about my answer or not.

(Gloria)

Alternate conceptions regarding the relationships between chemical bonding, spontaneity, and entropy, have been shown to relate to confusion and unwillingness to reconcile contradictory information in chemistry students (Teichert and Stacy, 2002). Ultimately, Gloria integrated an alternate conception seamlessly into her train of thought which resulted in a metacognitive judgment of high confidence.


Self-identified gaps. While the use of content knowledge and alternate conceptions were present in interviews, so too were examples of students’ ability to identify gaps in their content knowledge which prevented them from answering correctly. This was observed when students were explicitly able to define their own limitations and consequently ranked their confidence lower. Here, Lily was unable to correctly apply the provided equilibrium constant which resulted in a low confidence ranking:

I feel like if I were to know what I could use that [constant] for it would have been helpful for sure, because when I take chemistry exams or in my homework if I was given a constant, I always felt a lot better.

(Lily)

Lily's low confidence in her incorrect response indicated a degree of metacognitive skill. While Lily was unable to correctly apply content knowledge to respond to assessment questions, her ability to reflect and identify weaknesses in her own knowledge structures was captured through the metacognitive tier.

Personal influences

When explaining confidence ranking selections, students’ self-reflections often included a discussion of personal and/or affective factors. This category of factors identified instances when students tied their own abilities, experiences, or attributes to their confidence in a content item answer.
Past experiences. When providing reasoning for their confidence judgments, students often cited their past experiences to support their chosen rankings. These memories recollected by students were offered as evidence to support their confidence judgments for specific items or content areas. Repeat exposures both in formal classroom and informal study settings were experiences which impacted confidence. For example, Sharon explained how frequent encounters with a concept in both general chemistry courses as well as the experience of explaining the concept to a friend improved her confidence ranking.

I'm gonna put well for this one, just because I feel like I'm remembering these things correctly. We've talked about this topic in, not only in Gen Chem 1, but we also talked about it in the second one. So, I feel like it's been very reinforced and that's why I put well instead of average. I would say out of 100% I feel like 75%. During the school year, my friend was like ‘Oh, I don't get this’ so I would frequently have to explain this topic to her.

(Sharon)

Sharon framed repeat exposures with the construct in a variety of settings as experiences which influenced her high confidence ranking rather than strict application of content knowledge. When positive past experiences were related to confidence rankings, they were frequently applied in these topic-specific contexts related to repetition by instructors or peer-mediated interactions focused on one content area. These repeated cues regarding a specific content area can be considered environmental influences which impact metacognition when framed by SCT (Bandura, 1986).

Alternatively, negative past experiences more generally described students’ broader experiences with chemistry courses and interactions with their professors. One student who brought up negative interactions with instructors during the interview explained how these experiences led to a decline in her online homework and attending her professor's office hours:

I'd go with low confidence. I didn’t really get much help from the online homework for this, it was more about getting the points as opposed to actually doing it to ‘get’ it. I struggle with online homework in chemistry because I like to physically work stuff out and it was hard because [my professor] was the only professor teaching this course and I had a really full schedule last semester. I would go into [my professor's] office to talk to [them] about homework questions, but there would be 15 people in front of me so I couldn't get them answered. Then I’d find myself referring to online websites just to get the right answers as opposed to actually working through and understanding them.

(Celeste)

Similar to Celeste, Lily related her overall experience in chemistry to her confidence on a specific item stating that she generally did not feel confident in the course.

I definitely don't have any confidence in the answer that I chose. In general, the entire time I was taking chemistry, it felt like every time we were on a new part, I was still trying to figure out the last part.

(Lily)

For these students, their ability to metacognitively engage with the confidence tier was overridden by powerful feelings evoked by their previous experiences, either specifically with a content area or more broadly in their learning environments. In these cases, strong affective associations impacted students’ perceptions of their content knowledge.

This feeling state response demonstrates the mental shortcut of an affect heuristic which students used to guide their confidence judgments. The use of affect heuristics leads to choices that are either consciously or unconsciously guided by affective feelings (positive or negative) tied to experiences with the content area (Finucane et al., 2000). When affect heuristics were observed, the feelings invoked were described by students as directly influencing confidence judgments. When students are choosing confidence rankings based on emotional reactions to the assessment item, it can be reasonably assumed that they are not effectively engaging in metacognitive reflection. Confidence judgments are then based less on the experience of answering the content question and more on shortcuts in response to emotions triggered by the assessment item.

Self-judgments. Another factor used by students to describe their confidence ranking processes were generalized self-judgments which reflected overarching, big picture evaluations of their own abilities rather than their mastery of the specific content knowledge within the assessment.

Generalized positive self-judgments reflected a strong identity in STEM as a successful student. These judgments allowed students to overcome doubt or even a lack of content knowledge by buoying confidence rankings. When interviewed, Gus consistently described his confidence as high even when he acknowledged struggle or extended time required to select an answer. At the end of the interview, the research team prompted Gus to reflect on his overall interaction with the confidence tier and his consistently high rankings:

I believe in myself, so I if I take an ‘L’, I take an ‘L’. If I get a ‘W’, I get a ‘W’. I’m still going to be confident in myself when approaching the answers, so I will usually give myself ‘well’ and I'll always be ranking myself high. Whenever it comes to each specific question… it'll probably be a range, but overall, [my ranking would be] well. Whether I get questions right or wrong… if I feel like I'm at least on the right track, and if I need to study a little bit harder for something then I guess I will. Overall, I'd say it would always be well.

(Gus)

Gus demonstrated a strong, underlying positive judgment of his abilities in chemistry which translated into all high and average confidence rankings despite acknowledging that he may be answering some questions incorrectly. One possible explanation for these generally positive ranking would be to consider Gus's confidence judgment process as an example of utilizing the affect heuristic, as his positive feelings towards his overall ability to succeed on a general chemistry assessment dictated his decision to rank himself highly confident.

Not all self-judgments students described were positive, however. Students who exhibited negative judgments did so in reference to their overall ability in chemistry or in general assessment settings. These static negative evaluations were pervasive during the interview regardless of the concept being assessed. Lily described her confidence judgment process as a response to her assessment anxiety:

I feel like the majority of time I start to go into panic mode and it's like, ‘Okay I have to answer, I have to pick something, it has to be right.’ Because I do that, I definitely answer a lot of things incorrectly because I don't let myself have time to think it through and work it out. I always answer questions incorrectly and the way that I answer the questions I could probably be doing something differently.

(Lily)

Lily's descriptions of her confidence judgments used particularly severe language such as “panic mode”, “it has to be right”, and “I always answer questions incorrectly”. Based on Lily's thought processes, extreme language during the interview, and sweeping generalizations, Lily's ability to metacognitively engage with the confidence tier was being consistently overridden by critical self-judgment and static evaluations of her understanding. The reasons she cited when ranking her confidence were not observed to be directly tied to her current assessment of performance on the assessment items but rather a product of underlying negative self-judgments.

These examples of high and low baseline confidence due to self-judgments could be students’ use of self-efficacy judgments as a basis on which to rank themselves rather than confidence in their understanding. Self-efficacy beliefs are an example of personal influences which interplay with environmental and behavioural influences to affect students’ learning and metacognition (Bandura, 1986). If students identify themselves as particularly efficacious or inefficacious, these feelings of self-efficacy may be translated to their confidence rankings as fixed overall confidence which lacks sensitivity to each individual response provided. If the target purpose of the confidence tier is to promote metacognition, students with extreme positive or negative self-efficacy beliefs who interpret the prompt as a general reflection on their ability to answer chemistry assessment items may not be benefiting.

Comparing confidence across items. In some interviews, students compared their confidence on a specific item to their confidence on previous items in the assessment. As students continued to engage in metacognitive practice, they described how the recent experience of engaging in the confidence tier during the assessment resulted in the development of a personal set of criteria for their ranking which was constructed and reconstructed as the interview went on. Roy remarks on the adjustment of his ranking criteria here:

I would say I was able to do this one well. And earlier… I'd say it'd be well instead of average. I'd say well, not very well, because immediately I was like ‘Oh, I know this concept. I know what neutrons are, I know what protons are’ so I was able to take this off. And I know that electrons and protons are not what determines an isotope. But I was not able to confidently remove a second option.

(Roy)

As they were exposed to metacognitive practice over the course of the interview, students exhibited the ability to align their response processes within a set of personal criteria and described an understanding of their abilities with more detail and rigor. This has been demonstrated on a larger temporal scale in a longitudinal study which showed that students with more experience in chemistry learning environments exhibited better calibration than those with less experience (Atkinson and Bretz, 2021).

Peer comparison. When students rated their confidence, they not only referenced their own performance but also their perception of their classmates’ performance. For example, Sharon frequently remarked on how her rankings compared to how she thought her peers would perform. In some instances, she would rank herself based on her perception that her classmates understood a concept better than she did:

I want to put average because I feel like I said some things that were true but I'm gonna put poorly because I probably got it wrong to be honest. I know some people I was in class with definitely know the answer to this because it was on our tests, and it was something we had to understand.

(Sharon)

At another point in the interview, Sharon would shift and describe her confidence resulting from a stronger conceptual understanding relative to her peers:

I feel like other people probably don't remember this, but I'd like to say that I remembered the big topics… that's why I put average.

(Sharon)

Whether she felt like she had less skill or more robust knowledge than her peers, Sharon described a comparison between her own understanding and that of her classmates. If students interpret the confidence tier primarily as a tool to engage in peer comparison, it may be limiting their ability to engage in reflection of their own understanding, ultimately diminishing the intended effect of improving and engaging students’ metacognition and the types of conclusions an instructor could make about students’ ideas about their conceptual understanding.

Time dependence

The construct of processing fluency describes the subjective experience of ease or difficulty in accomplishing a cognitive task and is often expressed in terms of time required to retrieve knowledge (Finn and Tauber, 2015). Time required for retrieval of knowledge has been established in cognitive psychology literature as a factor which impacts Feelings of Rightness (FOR), a metacognitive experience which can signal to a learner when further analysis is needed (Thompson et al., 2011). Low FOR has been associated with longer rethinking times and an increased probability of answer change (Thompson et al., 2011). In this present study, the metacognitive experience employed a confidence judgment rather than FOR, as FOR is associated with claims surrounding quick-recall responses in studies utilizing Dual-Process Theory (Ackerman and Thompson, 2017). Despite content-specific differences in the measured construct, similar trends were still observed as many students completed their confidence judgments based on the amount of time required to answer.
Retrieval ease. In some cases, a self-observed short amount of time required to respond was described as a signal that increased students’ confidence where they cited swift retrieval of knowledge as having a positive effect on their overall confidence. Expectedly, this retrieval ease was an indicator of strong content knowledge and overlapped with the Correctly Applied Conceptions code to reflect that relationship in some instances. When Gretchen was asked to answer a question asking her to identify what was inside the bubbles in a pot of boiling water, she chose the correct answer immediately and described her confidence here:

I’d pick very well. I didn't have a lot of trouble with this question because it was a concept I was familiar with. And [the ranking] is based on how much I thought about it. Which was not very much. When I was doing the test it was just like, ‘that's water’. It was pretty straightforward.

(Gretchen)

In instances where quickness to respond represented use of a processing fluency heuristic to choose an option rather than correct conceptual knowledge, students also identified speed as an indicator of high confidence. In this example, Cady was asked to identify a bonding model which represented a water molecule and chose a distractor that did not account for hybridization. Her confidence was based on quick identification where no rethinking was observed.

I was the most confident with that question so far, I would say I answered it very well. When I was reading the question, I was already able to already visualize how I have learned and remembered a water molecule to look like. So, matching that up was pretty straightforward and that's why I feel that way.

(Cady)

Students who described quick thought processes resulting in rapid response times discussed how this retrieval ease improved confidence in their response. In addition, these students were not observed to go back and revisit other options to confirm their initial choice. When the first idea that came to mind was provided as an option, they often viewed the rapid response time as confirmation of a correct understanding.

Retrieval difficulty. On the other side of time dependence, instances were observed where students directly related longer processing times to lower confidence rankings. Here, Gretchen cited the length of time to choose answers for selecting a lower confidence.

I would say average because it took me a while and I had to think about more. I guess it was more than just my general knowledge of chemistry. I just had to think a little bit more so there's more room to second guess myself.

(Gretchen)

Gretchen went as far as identifying that when she took longer, she engaged in rethinking and even began to doubt her response, resulting in an average confidence ranking. This experience echoes previous literature using FOR judgments which relate extended rethinking times to an increased probability of answer change (Thompson et al., 2011). For both retrieval difficulty and retrieval ease, time was an unreliable signal for students as demonstrated by the varying degrees of success in their confidence rankings.

Test-taking strategies

The appropriate use of test-taking strategies can help students translate their knowledge from the classroom (McLellan and Craig, 1989) and has been shown to influence students’ performance on assessments (Dreisbach and Keogh, 1982). During interviews, students frequently applied strategic methods both explicitly and implicitly to answer questions and subsequently commented on use of these strategies when describing their confidence.
Identifying key words. Use of an associative processing heuristic leads students to choose answers which are based on surface features resembling their past observations or experiences (Talanquer, 2014). This past knowledge may be retrieved and used based on irrelevant or superficial similarities to current conditions. This was observed when students homed in on one familiar concept or phrase in an assessment item and used that easily activated information to select a response. Students were observed to place significant weight on this type of heuristic when ranking their confidence as it replicated feelings of retrieval ease and content mastery. Confidence was often ranked higher when surface features in an item prompted students to rely on pieces of easily activated knowledge.

Students cited this phenomenon as a test-taking strategy by describing the process of identifying key words wherein they would quickly scan the question and response options for easily activated concepts and immediately eschew other relevant information provided. Regina stated that she was able to apply this strategy which led to an average confidence ranking despite the fact that she was not familiar with the concept of dispersion forces. She also discussed her past experiences specifically with how she applies test-taking strategies:

I’m going to go with average since I didn't know the content super well to be able to answer it. But I used what I did know to try to eliminate some. I grew up doing academic team and taking standardized tests, so I’m used to using the strategies. Like, for this one, since I didn't really know the content, I picked apart the key words that were different in each of the answers, and then answered it based on the content. The first step in answering this question was just a standardized test-taking strategy rather than a chemistry strategy. I think that really helps you in taking questions like this and answering them.

(Regina)

Regina frequently applied strategies to answer when content knowledge alone was not enough to choose a response and her use of strategy directly informed her confidence ranking. Regina, like many other students across interviews, relied on both past experiences and test-taking strategies demonstrating the complex overlap and intersection of factors.

Elimination. The ability to eliminate options either using content knowledge or key words were often a factor that students described when ranking their confidence. A common thread across interviews was students’ discussion of the number of answer options they were able to confidently discard in their process of answering the question. Students even described a working criterion for how they ranked their confidence based exclusively on the number of eliminated options. For example, Haley explained that the difference between high and average confidence rankings was based on being able to definitively discard one versus two answer options.

It really made me think a lot, evaluating myself and keeping it as the same scale when I’m comparing yourself between questions. Like if I was struggling between two choices, maybe I could put well. And maybe if I knew the right answer immediately, I would put very well. If I could only cross off one choice, maybe I could put average, but I’d have to make sure that I do that on every other question too. Having the same scale when it comes to all the other questions is what I was trying to do. Just making sure I was comparing them the same way, so that way I could look back at it and know, ‘oh yeah, I didn't know this one because I didn't know what the choices meant’.

(Haley)

Haley identified that she had developed criteria for her confidence judgment and carefully based it on how many response options she was able to eliminate. She explained why the use of defined criteria was useful for her:

I just feel like it'd be easier for me to compare them together when it comes to studying. Like if I had an exam coming up, I would know that isotopes are something I really need to review because compared to the other questions, I struggled with them the most. So, having them be similar to each other when it comes to scaling would be helpful so that you can compare them and look back at it to decide how much time you really need to spend studying each topic.

(Haley)

Haley was able to clearly outline how she utilized her content knowledge within her application of test-taking strategies ultimately to further target her study efforts. This criteria development indicates significant metacognitive skill and awareness and may be a useful framing when students are learning about metacognition. The successful application of this strategy seems more indicative of the stability of the content knowledge used to eliminate response options rather than on robust metacognitive skill. That is not to say that students who utilize elimination strategies are not engaging in metacognition; the above Haley quote exemplifies a circumstance where chemistry content knowledge, the use of a test-taking strategy, acknowledgement of processing fluency, and metacognitive reflection are all present within the confidence judgment. Without the supporting conceptual understanding, elimination strategies can be unreliable in confidence rankings.

Guessing. Students who identified a gap in their content knowledge frequently used guessing as a strategy to answer assessment items. When guessing was cited as a factor which impacted a student's confidence judgment it was nearly always a reason for a low confidence ranking.

The Hasan hypothesis posits that students who answer correctly but rank their confidence as low may be identified by instructors and researchers as possible “guessers” (Hasan et al., 1999). While this was the case in some instances, students’ low confidence rankings for correct responses were generally more nuanced and often based on affective factors or stemmed from a potential self-identified gap in content knowledge rather than on true guessing. Generalizing all responses which fall into the low confidence and correct category as guesses neglects several dimensions of student understanding and metacognition.

Interestingly, students whose ability to correctly guess in the absence of chemistry knowledge has been reinforced through past experiences may more closely fall under the “metaignorant” calibration category. Regina cited guessing as a factor which actually improved her confidence based on her successful history of guessing on assessments.

It was my gut guess. I've always been pretty good at guessing on standardized tests, so I eliminated those two pretty quickly. And then out of the two left, I was just like, I'll pick that one. But doesn't mean it's right, but that's just what I was thinking… my first guess.

(Regina)

Regina explained that her previous experiences in successfully guessing increased her confidence while other students aligned with the Hasan hypothesis, highlighting variation from student to student. Clustering groups based on perceived metacognitive calibration in two-tiered assessments may have the unintended consequence of inadvertently mislabelling students like Regina due to the complex interweaving of factors which impact their confidence rankings.

Engaging with the confidence tier

At the end of interviews, students were asked to reflect on how they felt about ranking their confidence through the two-tier assessment format. This allowed students to provide descriptions of their perception of the confidence tier. Gretchen commented on how she developed a working criterion to rank her confidence and restructured it based on new information over the course of the interview:

So now since I've said ‘well’, I would say that not having reasons for the other answers to be wrong would be the distinction between ‘very well’ and ‘well’. So ‘very well’ would be when I could argue why each wrong choice is wrong and know I’m right because I have that information. But ‘well’ would be when I know which one is right, but I don't know why some of the other choices aren't right. I feel like whenever I started these ranking things I just picked ‘very well’ but then as it went on it became like ‘compared to the last question, it's not ‘very well’, it's just ‘well’’. It just gets more defined as I go along. Like in the beginning, like the first question, if it was hard for me then I'd be less inclined to say, ‘average’ and more inclined to say, ‘very poorly’, because I don't know what's coming. I guess I'm comparing it to the other questions.

(Gretchen)

This indicates that significant cognitive power may be used to structure and restructure a personal rubric for confidence judgments across an entire assessment.

Students responded to the confidence tier positively, indicating that the tool was helpful for them and could be applied in their future studies. Cady discussed how she would engage in metacognitive reflection and self-judgment to guide her study efforts.

I think this is great… I'm actually going to take this with me in my studies this coming year. I feel like it's really important to just see how you feel about certain questions because it allows you to have a starting point for things you need to work on and things that you can skip over. I know when studying sometimes I go over things that I'm already good at just for a confidence booster, and honestly that's kind of a waste of time. So, I like this method a lot actually.

(Cady)

When the utility of the tool was verbalized by the student within the interview, it was well-conceptualized and described as being applicable in contexts outside of chemistry and outside of summative assessments. Cady was able to describe in detail how she would use metacognitive confidence judgments in the future while studying to direct her focus and attention and ultimately improve her performance on assessment tasks. This exemplifies the interaction observed between environmental and personal influences (i.e., feedback and self-reflection) and behavioural factors (i.e., targeted study tasks) guided by the metacognitive tier when considered within the theoretical framework of this study.

Factors and confidence-accuracy calibration (RQ2)

During analysis, responses of content knowledge and paired confidence were characterized as calibrated responses, i.e., confidence aligned with correctness, or miscalibrated responses, i.e., misaligned confidence to correctness. Calibration is the metacognitive process in which students can measure how effectively a judged rating of performance corresponds to their actual performance (Keren, 1991). With this definition in mind, the aim was to explore the assumption that calibrated responses provided evidence of metacognitive skill in chemistry learners, and that gaps in metacognitive skill would lead to miscalibrated confidence in students’ accuracy.

Possible response patterns are outlined in Fig. 4. Student responses were categorized as correct or incorrect and the calibration of their confidence was based on their selection of a high or low confidence ranking. Response patterns outlined in the dashed box represent miscalibrated confidence judgements.


image file: d3rp00127j-f4.tif
Fig. 4 Classification calibration categories based on students’ response patterns. The dashed box represents miscalibrated response patterns for both high and low confidence.

Calibrated confidence judgments

Students who demonstrated calibrated high confidence exhibited a ‘known known’ and those with calibrated low confidence exhibited a ‘known unknown’ (Fig. 4). For students who exhibited calibration, this was indicative of metacognitive skill where students appropriately judged their confidence. The students who fell into this category either answered the item correctly and felt confident in their answer or recognized they did not know the answer subsequently ranking themselves with low confidence.
High confidence-correct response. Students who answer chemistry content questions correctly and rank their confidence as high are indicating to themselves, as well as to the instructor and/or researcher, that they are engaging in calibrated metacognitive monitoring. When High Confidence-Correct Response (HC) processes were analysed, students who fell into this high performing, highly confident group were often observed to describe their content mastery as a direct influence on their confidence. Other factors like retrieval ease/processing fluency and the use of test-taking strategies were also frequently cited by these students as support for their high confidence rankings. For instance, students cited the number of options eliminated as a way to gauge their confidence. If they were able to employ their content knowledge to eliminate all but one of the response options, students felt very confident in their answer. Not only were students showing they conceptually knew the correct answer, but their confidence was boosted by being able to identify the remaining options as incorrect. While the use of test-taking strategies was pervasive across all performance levels, students who combined them with stable, correct conceptions were more successful on assessment items and provided the most calibrated confidence judgments. This indicates that calibration is likely dependent on the strength of content knowledge and that the use of strategies doesn’t necessarily overshadow metacognition. Basing confidence rankings on strategies which are utilized in tandem with content mastery still allows for metacognition to be captured by the confidence tier.

Two main observations regarding calibrated, highly confident students can be made based on HC interview responses. First, students who rank their confidence as high are often ranking their confidence based on how well they performed within the constraints of an assessment rather than how well they conceptually understood and applied their chemistry knowledge. This is likely a result of the language used in the stem of the confidence tier which asks students to rank how well they felt they answered the question. The resulting measurement therefore captures how well students feel they navigate exams rather than their confidence in their chemistry knowledge, lessening the benefit of engaging in metacognition within a chemistry assessment context. Secondly, the prevalence of students basing their confidence on test-taking strategies independent of correct conceptual understanding may be a considered a limitation specific to multiple choice question type-assessments which are uniquely suited to strategic thinking. For the confidence tier to function as intended, and for it to accurately measure the construct it is designed to, students may require priming or more targeted language to outline the purpose of engaging in metacognitive practices and direct their self-reflection towards content mastery rather than assessment competence.

Low confidence-incorrect response. On the opposite end of the performance spectrum, students who answer incorrectly and rank their confidence as low are also signalling that they are engaged in calibrated metacognitive monitoring. Students who cite their lack of content knowledge or understanding of a chemistry concept to rank their incorrect answer with low confidence are providing evidence of metacognitive skill which can be leveraged to direct future learning behaviours. The Low Confidence-Incorrect Response (LI) processes showed that students were exhibiting calibrated metacognitive skill, but that confidence judgments were deeply enmeshed with affective factors like negative self-judgments and peer comparison, as well as with limited processing fluency. It is possible that these affective and time-related factors may have originated in experiences related to content knowledge from their learning process, but students continue to use these same rigid frames without considering their current content knowledge. When LI students brought up test-taking strategies, they often described their inability to eliminate a majority of the response options, prompting them to doubt their understanding of the concept. For these metacognitively calibrated students, this expression of doubt was appropriate because they selected an incorrect answer and lacked the appropriate content knowledge.

Successfully engaging in metacognitive calibration generally serves students well as they engage in both formative and summative assessments. For students whose responses fall under this LI category, however, their rankings are influenced by the emotions evoked by the experience of being assessed, both by the assessment and by themselves. These affective domains are not being directly targeted by the confidence tier, but they are being captured as students revert back to their mindset from the learning process. While the data being gathered accurately reflects the low confidence of students with lesser chemistry knowledge, the negative focus and self-talk that is occurring in response to this tool can be, at best, distracting, and at worst, damaging to the target population. This may have unintended consequences which could impede their ability to self-assess in the future.

Miscalibrated confidence judgments

Students with miscalibrated high confidence exhibited an ‘unknown unknown’ and those with miscalibrated low confidence expressed an uncertainty about their conceptual understanding, or what could be considered an ‘unknown known’. In Fig. 4, responses falling into the miscalibrated categories are represented in the dashed box. It is particularly important to address metacognitive miscalibration because confidence rankings captured by the assessment which did not align with the correctness of content knowledge indicate that students are not accurately evaluating what they know.
High confidence-incorrect response. Miscalibrated assessment of an incorrect response with high confidence has been established in CER literature to be aligned with Dunning and Kruger's metaignorance phenomenon, where chemistry students who have limited content knowledge demonstrate less accurate metacognitive judgments (Bell and Volckmann, 2011; Brandriet and Bretz, 2014; Pazicni and Bauer, 2014; Connor et al., 2021). In assessment development studies where cluster analysis is performed to identify groups of similarly performing students, emergent groups consistent with the Dunning–Kruger phenomenon are presented for instructors as targets for interventions.

In interviews, students who selected high confidence when incorrect were often observed to hold deep-seated alternate conceptions which prompted a high level of confidence. Students with alternate conceptions cited processing fluency because the information they drew on was readily available or easily activated, and these signals served as indicators that they knew the concept well. However, these parameters resulted in a false-positive which signalled conceptual understanding to students. Possessing any alternate conception indicates that a student is not a complete novice as they have some level of understanding of chemical concepts. At this stage in their education, students have begun to master some topics making them feel confident in their content knowledge. Because students are enrolled in an introductory chemistry course and are relatively new to chemistry content at large, they may be at the stage where they have enough working knowledge to engage in the material but are still misunderstanding key concepts. False-positive signals of processing fluency corresponding to an alternate conception consequently made students feel overconfident in their knowledge. In some instances, quick activation of a concept can be appropriately signalling correct conceptual understanding, however, it can be particularly deceptive when the student holds a deep-seated alternate conception. As such, these signals should be used with caution and is an important discussion point when asking students to engage in metacognitive tasks.

Students who are miscalibrated by being overly confident do not recognize crossing a boundary from the concepts they correctly and incorrectly understand. Knowing when this boundary is crossed is the target metric of metacognition and improving the awareness of such crossing is central to engaging in metacognitive tasks. The challenge remains that even experts are not aware of the unknown unknown. That is, even experts are sometimes unaware of the boundary crossing. Improving metacognitive skill is still a target for chemistry learners to master.

Low confidence-correct response. While some students exhibited overconfident miscalibration, others displayed underconfident miscalibration. This category has been utilized to characterize guessing on assessment items (Hasan et al., 1999) and in some instances that is the case. However, interviews also elicited instances when students had low confidence based on various external factors. The factors lowering confidence in many cases are affective or experiential in nature rather than strict evaluation of understanding of concepts on the assessment item, which contributes to the murky utility of this tier quantitatively.

The students in this category often possess correct conceptions regarding the chemistry content but were not confident in that knowledge. In some cases, students described past experiences of initially struggling with the material but did not consider whether they had since developed a better understanding of that concept. Their past experience of struggle was overriding any present reflection on conceptual understanding. In other cases, students were doubtful of their answer because of their inability to eliminate all but one response option. Due to the nature of multiple-choice assessments and the prevalence of test-taking strategies, many students cited the fact that they were stuck deciding between two options or that they could not confidently eliminate the remaining options as metrics which cast doubt on their confidence. Many of these students were able to correctly explain a concept out loud to interviewers but then felt doubtful as they engaged in the multiple-choice question. In these instances, the prevalence of test-taking strategies impeded students from exclusively considering their content knowledge. For students in the underconfident miscalibration category, it may be helpful to prompt them to generate an answer to the stem prior to reading the response options and select the option that best matches their initial thought.

Average confidence responses. Students who selected average confidence were considered to be miscalibrated, as content questions were dichotomous in nature and average could not be aligned to calibrate with either correct or incorrect response options. Selection of this response option led to difficulties capturing calibration of metacognition. Regardless of understanding, the option to choose average presents students with the opportunity to choose a seemingly neutral option rather than rank themselves high or low.

In many cases, selecting average indicated that a secondary dimension may be encroaching on a student's ability to engage in true metacognition. For example, Ted held substantial content knowledge but expressed doubt in many of his responses, and ultimately ranked half of his responses as average. In another case, Gus possessed high underlying self-efficacy which may have overinflated confidence in his application of content knowledge on specific questions. In these scenarios, overriding self-efficacy perceptions can interfere with effective metacognition.

In other instances, a tendency to compare one's performance to peers’ performance may be leading students to rank themselves as average in an effort to sort themselves as part of the group. Again, this indicates less engagement with metacognition and more so with ancillary thought processes separate from their accuracy on the assessment.

Factors across calibration categories. For each calibration category, emergent themes are summarized in Fig. 5. Throughout the results, quotes were introduced that were representative of participants’ views. Fig. 5 serves as a visualization of all the emergent themes across interviews. For instance, test-taking strategies such as Elimination were used in every calibration category to varying degrees of success. When combined with stable chemistry conceptions, test-taking strategies were often helpful, however they were also observed as false-positive signals in students who incorrectly applied conceptions to answer assessment items. Another factor, such as self-judgment was present across a majority of the categories where students’ underlying self-efficacy shaped their confidence judgement independent of content knowledge. Broadly across categories, students were observed to draw on a wide variety of both content- and affective-related factors to select their confidence ranking.
image file: d3rp00127j-f5.tif
Fig. 5 Prevalent factor themes present in calibration categories.

Limitations

The findings from this study are constrained to the context that data collection was performed in, namely using multiple choice assessment items in an interview setting. Metacognitive processes and test-taking strategies referenced were specific to multiple-choice assessment formats. While this is a common assessment format that students will likely see in future courses, assumptions outside of this context cannot be made. Additionally, the multiple-choice questions presented in interviews were items from the Water Instrument assessment (Balabanoff et al., 2022) that students had previously interacted with. Interviews took place shortly after the course had ended. The extent to which temporal or repetitive interactions with assessment items may have influenced confidence rankings is unclear, however, these are constraints characteristic of response process validity interviews (Miles et al., 2014) and may be further complicated in the present study by the influence of environmental cues on metacognition and learning (Bandura, 1986). As such, this may be an area of future investigations to see how repeat exposures influence confidence ratings.

During interviews, accuracy-confidence calibration was elicited as a point-in-time measurement, meaning that claims cannot be made about the stability of confidence in content knowledge over time based on the findings. While the interviews collected came from two different higher-education institutions, demographics and subsequent trends were not accounted for within the confines of this study.

Future work includes investigating the intersection of demographic, environmental, and/or identity factors and their role regarding student confidence and calibration, as well as exploring the possible influence of social desirability on students engaging in metacognition while speaking with an interviewer.

Implications and future research

The confidence tier has been utilized in assessment development as a way to measure and promote metacognition as well as generate claims about chemistry students’ conceptual understanding. Students are basing their confidence rankings on other factors in addition to content knowledge which may skew their confidence ranking and cloud their metacognitive judgments.

If the goal is to measure confidence to make more comprehensive claims about students’ content knowledge, there is significant evidence from response process validity interviews that external factors are impeding the assessment from capturing metacognitive skill and content-centred confidence. There are several instances where students are engaging in calibrated metacognition. However, the processes that students outline in their responses show a host of influences beyond content mastery when a confidence ranking is chosen. The issue is not that the confidence tier does not capture confidence judgments, but rather that it captures additional information.

If the goal of including a confidence tier in a two-tier assessment is to promote metacognition, assessment designers in both research and instructional settings should consider that students are considering factors beyond strict metacognition. This tier may serve students who already possess strong metacognitive skill to engage in the practice but may prompt those without significant experience with metacognitive monitoring to spend valuable cognitive energy thinking about external factors.

For students who rank themselves based on their perceived understanding relative to their peers, the confidence tier is providing incomplete information about individual students’ metacognitive abilities to those analysing its output. Ranking oneself relative to the class is likely a symptom of the U.S. higher education system where curving grades is a common practice. This means that effort must be put into priming students on the utility of the confidence tier to better contextualize confidence as relative to the material rather than relative to their peers. Explicitly priming students by explaining the target construct of the confidence tier before they take a two-tiered assessment may provide them with more clarity and purpose to engage in metacognitive practice.

Restructuring both the testing stem and response options to clarify the purpose of the confidence tier may improve both promotion and measurement of students’ metacognition. One technique that could assist in clarifying the metacognitive target of the confidence tier for students would be using more targeted stems in prompts. Many assessments using confidence tiers utilize language such as “how well do you feel you answered the previous question?” and may benefit from rephrasing. Specifically, directing students’ attention to reflect on how effectively they were able to apply their content knowledge would assist in narrowing their focus. Additional clarity can also come with restructuring response options. More thorough understanding of students’ calibration may come if an average option is not provided in a confidence tier, as for many students this option is chosen as a null response (Chyung et al., 2017) based on peer comparison or an underlying baseline self-efficacy feeling. Constraining students to select either high or low confidence has the potential to promote targeted metacognitive engagement as well as provide richer data for analysts. In some studies, the removal of a mid-point point resulted in respondents selecting either a more positive or negative option (Worcester and Burns 1975; Garland, 1991) suggesting context-specific engagement with Likert-scale items. As such, the removal of average from confidence items warrants further investigation in the context of chemistry assessments.

For continued use of confidence tiers, these findings highlight the need for assessment designers to consider what the intended construct of interest and design item stems to specifically capture that construct. As previously posited, miscalibration of high confidence can be an indicator of deep-seated alternate conceptions. Miscalibrated low confidence, which has previously been cited as a hallmark of guessing, is more accurately providing insight into environmental factors such as students’ preconceived notions about their own chemistry abilities. These inferences may become more reliable as the clarity of the testing stems and the structure of response options is improved, and with the inclusion of explicit priming.

The interpretation of this assessment output by instructors to focus instructional changes or target individual miscalibrated students may be most useful if paired with explicit instruction regarding how results will be used. A recent study found that students’ confidence rankings in two-tier question formats were highly calibrated across all performance levels when surveyed using clickers during class periods (Bunce et al., 2023). This presents a compelling argument for use of two-tier assessment items in low-stakes formative assessments to allow instructors to provide immediate feedback on miscalibrated concept areas. This may be especially beneficial as the timeliness of the feedback may impact the interference seen in the current study by external factors due to the temporal nature of collecting interviews post-assessment. Ongoing interaction with confidence rankings in these formative assessment settings during a course provides practice for students to better engage in metacognition, providing higher quality feedback for instructors and stronger metacognitive skill development for students. When asked to engage with the confidence tier in interviews, students responded positively, and some went as far as to indicate that they would engage in their own confidence reflections independently in their future assessments. This benefit would be further supported with post-instruction and discussion of the utility and importance of metacognitive monitoring.

It is imperative in promoting metacognition to move away from a deficit mindset when students are misinformed or uninformed. An alternative framing for miscalibration of confidence rankings is as an opportunity to improve metacognitive skill and better prepare students for future educational endeavours rather than as an opportunity to identify students with misaligned perceived accuracy for the sake of informing them that they are misaligned. Metacognition is a powerful and important tool in a chemistry student's toolbox, so assessment tools which target this skill offer more holistic student feedback. It is up to those who use these tools in assessment development, both in research and instructional contexts, to decide what kind of metacognitive task is most beneficial for that particular context. If the confidence judgment is the best fit for the assessment through targeting accuracy in a specific cognitive domain, some adjustment may be warranted to better capture strict metacognition of content application. This may mean utilizing confidence rankings more frequently in formative assessment settings and/or improving and narrowing prompts and response options. Alternatively, well-designed self-efficacy prompts may provide the information desired by the assessment developer which would call for its own specific set of criteria to be met. Ultimately, assisting students in developing an accurate understanding of the utility of the metacognitive tier chosen will allow for it to be more reliably used to identify and combat alternate conceptions and improve chemistry instruction.

Conclusions

This study investigated students’ confidence in their content knowledge within the context of a two-tiered general chemistry assessment. While previous studies have quantitatively shown that students sometimes have miscalibrated confidence often attributed to metaignorance, this study observed this miscalibration in general chemistry students with qualitative interview data providing a more robust understanding as to why it is occurring. Students who fell into the miscalibrated group often used unreliable metrics such as processing fluency which may mimic robust understanding of chemistry concepts. The students who were more successful in evaluating their confidence more heavily relied on their stable conceptual understanding. Many students cited previous experiences in their course or with peers as factors that negatively or positively impacted their confidence. These findings suggest that the confidence tier is indeed capturing students’ self-assessment, however, students’ confidence ranking is based on a large range of factors independent of content knowledge which are ultimately not productive as students engage in this metacognitive task.

Conflicts of interest

There are no conflicts to declare.

Acknowledgements

We would like to thank the students who gave their time and thoughtful responses in interviews as well as their instructors who permitted us to recruit and collect data in their classes. Additional acknowledgement must be given to Alena Moon who offered guidance in the early stages of this work.

References

  1. Abell T. N. and Bretz S. L., (2019), Development of the Enthalpy and Entropy in Dissolution and Precipitation Inventory, J. Chem. Educ., 96(9), 1804–1812.
  2. Ackerman R. and Thompson V. A., (2017), Meta-Reasoning: Monitoring and Control of Thinking and Reasoning, Trends, Cognitive Sci., 21(8), 607–617.
  3. Adey P., Shayer M. and Yates C., (1989), Thinking science: the curriculum materials of the Cognitive Acceleration through Science Education (CASE) project, London: Nelson Thornes.
  4. Allwood C. M., Jonsson A. C. and Granhag P. A., (2005), The effects of source and type of feedback on child witnesses’ metamemory accuracy, Appl. Cognitive Psychol., 19, 331–344.
  5. Arjoon J. A., Xu X. and Lewis J. E., (2013), Understanding the state of the art for measurement in chemistry education research: examining the psychometric evidence, J. Chem. Educ., 90(5), 536–545.
  6. Atkinson M. B. and Bretz S. L., (2021), Measuring Changes in Undergraduate Chemistry Students’ Reasoning with Reaction Coordinate Diagrams: A Longitudinal, Multi-institution Study, J. Chem. Educ., 98(4), 1064–1076.
  7. Atkinson M. B., Popova M., Croisant M., Reed D. J. and Bretz S. L., (2020), Development of the reaction coordinate diagram inventory: measuring student thinking and confidence, J. Chem. Educ., 97(7), 1841–1851.
  8. Avargil S., (2019), Learning chemistry: self-efficacy, chemical understanding, and graphing skills, J. Sci. Educ. Technol., 28(4), 285–298.
  9. Balabanoff M., Al Fulaiti H., DeKorver B., Mack M. and Moon A., (2022), Development of the Water Instrument: a comprehensive measure of students’ knowledge of fundamental concepts in general chemistry, Chem. Educ. Res. Practice, 23(2), 348–360.
  10. Bandura A., (1977), Self-efficacy: toward a unifying theory of behavioral change, Psychol. Rev., 84(2), 191–215.
  11. Bandura A., (1986), Social foundations of thought and action: A social cognitive theory, Englewood Cliffs, NJ: Prentice-Hall.
  12. Bandura A., (1997), Self-Efficacy: The exercise of control, New York, NY: W. H. Freeman.
  13. Barbera J. and VandenPlas J. R., (2011), All assessment materials are not created equal: the myths about instrument development, validity, and reliability, in Investigating classroom myths through research on teaching and learning, American Chemical Society, pp. 177–193.
  14. Bell P. and Volckmann D., (2011), Knowledge surveys in general chemistry: confidence, overconfidence, and performance, J. Chem. Educ., 88(11), 1469–1476.
  15. Blank L. M., (2000), A metacognitive learning cycle: a better warranty for student understanding? Sci. Educ., 84(4), 486–506.
  16. Brandriet A. R. and Bretz S. L., (2014), The development of the redox concept inventory as a measure of students’ symbolic and particulate redox understandings and confidence, J. Chem. Educ., 91(8), 1132–1144.
  17. Bodner G. M., (1986), Constructivism: A theory of knowledge, J. Chem. Educ., 63(10), 873.
  18. Bodner G. M., (1991), I have found you an argument: The conceptual knowledge of beginning chemistry graduate students, J. Chem. Educ., 68(5), 385.
  19. Bunce D. M., Schroeder M. J., Luning Prak D. J., Teichert M. A., Dillner D. K., McDonnell L. R., Midgette D. P. and Komperda R., (2023), Impact of Clicker and Confidence Questions on the Metacognition and Performance of Students of Different Achievement Groups in General Chemistry, J. Chem. Educ., 100(5), 1751–1762.
  20. Burson K. A., Larrick R. P. and Klayman J., (2006), Skilled or unskilled, but still unaware of it: perceptions of difficulty drive miscalibration in relative comparisons, J. Personality Soc. Psychol., 90, 60–77.
  21. Casselman B. L. and Atwood C. H., (2017), Improving general chemistry course performance through online homework-based metacognitive training, J. Chem. Educ., 94(12), 1811–1821.
  22. Cervone D. and Peake P. K., (1986), Anchoring, efficacy, and action: the influence of judgmental heuristics on self-efficacy judgments and behavior, J. Personality Soc. Psychol., 50(3), 492.
  23. Chyung S. Y., Roberts K., Swanson I. and Hankinson A., (2017), Evidence-Based Survey Design: The Use of a Midpoint on the Likert Scale, Performance Improvement, 56(10), 15–23.
  24. Clinchot M., Ngai C., Huie R., Talanquer V., Lambertz J., Banks G., Weinrich M., Lewis R., Pelletier P. and Sevian H., (2017), Better Formative Assessment, The Sci. Teacher, 84(3), 69.
  25. Connor M. C., Glass B. H. and Shultz G. V., (2021), Development of the NMR Lexical Representational Competence (NMR-LRC) Instrument As a Formative Assessment of Lexical Ability in 1H NMR Spectroscopy, J. Chem. Educ., 98(9), 2786–2798.
  26. Cook E., Kennedy E. and McGuire S. Y., (2013), Effect of teaching metacognitive learning strategies on performance in general chemistry courses, J. Chem. Educ., 90(8), 961–967.
  27. Dalgety J. and Coll R. K., (2006), Exploring First-Year Science Students’ Chemistry Self-Efficacy, Int. J. Sci. Math. Educ., 4(1), 97–116.
  28. Davidson J. E., Deuser R. and Sternberg R. J., (1994), The role of metacognition in problem solving, in Metacognition: Knowing about Knowing, Cambridge, MA, US: The MIT Press, pp. 207–226.
  29. Davis, E. A., (1996), Metacognitive scaffolding to foster scientific explanations. Paper presented at the Annual Meeting of the American Educational Research Association, New York, 8–14 April.
  30. Dori Y. J., Avargil S., Kohen Z. and Saar L., (2018), Context-based learning and metacognitive prompts for enhancing scientific text comprehension, Int. J. Sci. Educ., 40(10), 1198–1220.
  31. Dreisbach M. and Keogh B. K., (1982), Testwiseness as a factor in readiness test performance of young Mexican-American children, J. Educ. Psychol., 74(2), 224.
  32. Farh J. L. and Dobbins G. H., (1989), Effects of comparative performance information on the accuracy of self-ratings and agreement between self-and supervisor ratings, J. Appl. Psychol., 74(4), 606.
  33. Ferrell B. and Barbera J., (2015), Analysis of students’ self-efficacy, interest, and effort beliefs in general chemistry, Chem. Educ. Res. Pract., 16(2), 318–337.
  34. Finn B. and Tauber S. K., (2015), When confidence is not a signal of knowing: How students’ experiences and beliefs about processing fluency can lead to miscalibrated confidence, Educ. Psychol. Rev., 27, 567–586.
  35. Finucane M. L., Alhakami A., Slovic P. and Johnson S. M., (2000), The affect heuristic in judgments of risks and benefits, J. Behav. Decis. Mak., 13(1), 1–17.
  36. Flavell J. H., (1979), Metacognition and cognitive monitoring: a new area of cognitive-developmental inquiry, Am. Psychol., 34, 906–911.
  37. Gabel D. L. and Bunce D. M., (1994), Research on Chemistry Problem Solving, in Handbook of Research in Science Teaching and Learning, New York, NY: Macmillan, pp. 301–326.
  38. Garland R., (1991), The Mid-Point on a Rating Scale: Is it Desirable? Marketing Bull., 2, 66–70.
  39. Georghiades P., (2000), Beyond conceptual change learning in science education: focusing on transfer, durability and metacognition, Educ. Res., 42(2), 119–139.
  40. Georghiades P., (2004), From the general to the situated: three decades of metacognition, Int. J. Sci. Educ., 26(3), 365–383.
  41. Gigerenzer G., (1991), How to make cognitive illusions disappear: beyond “heuristics and biases”, Eur. Rev. Soc. Psychol., 2(1), 83–115.
  42. Gigerenzer G. and Todd P. M., (1999), Fast and frugal heuristics: the adaptive toolbox, in Simple heuristics that make us smart, Oxford University Press, pp. 3–34.
  43. Gilovich T., Griffin D. and Kahneman D., (2002), Heuristics and biases: The psychology of intuitive judgment, Cambridge University Press.
  44. Glaser B. G., (1965), The constant comparative method of qualitative analysis, Social problems, 12(4), 436–445.
  45. Hacker D. J., Bol L. and Bahbahani K., (2008), Explaining calibration accuracy in classroom contexts: the effects of incentives, reflection, and explanatory style, Metacognition Learn., 3(2), 101–121.
  46. Han J., Kamber M. and Pei J., (2012), Data Mining: Concepts and Techniques, Morgan Kaufmann, 3rd edn, vol. 10, pp. 361–367.
  47. Hasan S., Bagayoko D. and Kelley E. L., (1999), Misconceptions and the Certainty of Response Index (CRI), Phys. Educ., 34(5), 294.
  48. Hawker M. J., Dysleski L. and Rickey D., (2016), Investigating General Chemistry Students’ Metacognitive Monitoring of Their Exam Performance by Measuring Postdiction Accuracies over Time, J. Chem. Educ., 93(5), 832–840.
  49. Heckler A. F. and Scaife T. M., (2015), Patterns of response times and response choices to science questions: the influence of relative processing time, Cognitive Sci., 39(3), 496–537.
  50. Heredia K. and Lewis J. E., (2012), A psychometric evaluation of the colorado learning attitudes about science survey for use in chemistry, J. Chem. Educ., 89(4), 436–441.
  51. Jacobs J. E. and Paris S. G., (1987), Children's metacognition about reading: issues in definition, measurement, and instruction, Educ. Psychol., 22(3–4), 255–278.
  52. Kan A. and Akbaş A., (2006), Affective factors that influence chemistry achievement (attitude and self efficacy) and the power of these factors to predict chemistry achievement-I, J. Turkish Sci. Educ., 3(1), 76–85.
  53. Keren G., (1991), Calibration and probability judgments: conceptual and methodological issues, Acta Psychol., 77, 217–273.
  54. Kruger J. and Dunning D., (1999), Unskilled and Unaware of It: How Difficulties in Recognizing One's Own Incompetence Lead to Inflated Self-Assessments, J. Pers. Soc. Psychol., 77 (6), 1121–1134.
  55. Lavi R., Shwartz G. and Dori Y. J., (2019), Metacognition in Chemistry Education: A Literature Review, Isr. J. Chem., 59(6–7), 583–597.
  56. Livingston J. A., (2003), Metacognition: An Overview, Buffalo, New York.
  57. Mahaffy P., (2004), The Future Shape of Chemistry, Chem. Educ. Res. Pract., 5(3), 229–245.
  58. McClary L. M. and Bretz S. L., (2012), Development and assessment of a diagnostic tool to identify organic chemistry students’ alternative conceptions related to acid strength, Int. J. Sci. Educ., 34(15), 2317–2341.
  59. McLellan J. and Craig C., (1989), Facing the Reality of Achievement Tests, Educ. Canada, 29(2), 36–40.
  60. Miles M. B., Huberman A. M. and Saldaäna J., (2014), Qualitative data analysis: A methods sourcebook, 3rd edn, SAGE Publications, Inc.
  61. Nelson T. O., (1996), Gamma is a measure of the accuracy of predicting performance on one item relative to another item, not the absolute performance on an individual item: comments on Schraw (1995), Appl. Cognitive Psychol., 10, 257–260.
  62. NGSS Lead States, (2013), Next Generation Science Standards: For States, By States, Washington, DC: The National Academies Press.
  63. Nieswandt M., (2007), Student affect and conceptual understanding in learning chemistry, J. Res. Sci. Teach., 44(7), 908–937.
  64. Nietfeld J. L., Cao L. and Osborne J. W., (2005), Metacognitive monitoring accuracy and student performance in the postsecondary classroom, J. Exp. Educ., 74, 7–28.
  65. Nietfeld J. L., Cao L. and Osborne J. W., (2006), The effect of distributed monitoring exercises and feedback on performance, monitoring accuracy, and self-efficacy, Metacogn. Learn., 1, 159–179.
  66. Pallier G., Wilkinson R., Danthiir V., Kleitman S., Knezevic G., Stankov L. and Roberts R. D., (2002), The Role of Individual Differences in the Accuracy of Confidence Judgments, J. General Psychol., 129(3), 257–299.
  67. Pazicni S. and Bauer C. F., (2014), Characterizing illusions of competence in introductory chemistry students, Chem. Educ. Res. Pract., 15(1), 24–34.
  68. Peterson C. H., Peterson N. A. and Powell K. G., (2017), Cognitive interviewing for item development: validity evidence based on content and response processes, Meas. Eval. Couns. Dev., 50(4), 217–223.
  69. Rozin P., (2005), The meaning of “natural” process more important than content, Psychol. Sci., 16(8), 652–658.
  70. Schraw G., (2009), A conceptual analysis of five measures of metacognitive monitoring, Metacognition Learn., 4(1), 33–45.
  71. Schraw G. and Moshman D., (1995), Metacognitive theories, Educ. Psychol. Rev., 7, 351–371.
  72. Schunk D. H. and DiBenedetto M. K., (2020), Motivation and social cognitive theory, Contemp. Educ. Psychol., 60, 101–832.
  73. Swanson H. L., (1990), Influence of metacognitive knowledge and aptitude on problem solving, J. Educ. Psychol., 82(2), 306.
  74. Talanquer V., (2014), Chemistry Education: Ten Heuristics To Tame, J. Chem. Educ., 91(8), 1091–1097.
  75. Teichert M. A. and Stacy A. M., (2002), Promoting understanding of chemical bonding and spontaneity through student explanation and integration of ideas, J. Res. Sci. Teach., 39(6), 464–496.
  76. Thiede K. W., Anderson M. C. M. and Therriault D., (2003), Accuracy of metacognitive monitoring affects learning of texts, J. Educ. Psychol., 95(1), 66.
  77. Thiede K. W., Wiley J. and Griffin T. D., (2011), Test expectancy affects metacomprehension accuracy, Br. J. Educ. Psychol., 81(2), 264–273.
  78. Thompson V. A., Prowse Turner J. A. and Pennycook G., (2011), Intuition, reason, and metacognition, Cogn. Psychol., 63(3), 107–140.
  79. Tourangeau R., (1984), Cognitive sciences and survey methods, Cogn. Aspects Survey Methodology: Build. Bridge Disciplines, 15, 73–100.
  80. Treagust D. F., (1986), Evaluating students' misconceptions by means of diagnostic multiple choice items, Res. Sci. Educ., 16(1), 199–207.
  81. Treagust D. F., (1988), The development and use of diagnostic instruments to evaluate students' misconceptions in science, Int. J. Sci. Educ., 10, 159–169.
  82. Treagust D. F., (1995), Diagnostic assessment of students' science knowledge, in Glynn, S. M. and Duit, R., (ed.), Learning science in the schools: Research reforming practice, Routledge, pp. 327–346.
  83. Tversky A. and Kahneman D., (1996), On the reality of cognitive illusions, Psychol. Rev., 103(3), 582–591.
  84. Villafañe S. M., Garcia C. A. and Lewis J. E., (2014), Exploring diverse students' trends in chemistry self-efficacy throughout a semester of college-level preparatory chemistry, Chem. Educ. Res. Pract., 15(2), 114–127.
  85. Watson S. W., Dubrovskiy A. V. and Peters M. L., (2020), Increasing chemistry students’ knowledge, confidence, and conceptual understanding of pH using a collaborative computer pH simulation, Chem. Educ. Res. Pract., 21(2), 528–535.
  86. Webb J. M., Stock W. A. and McCarthy M. T., (1994), The Effects of Feedback Timing on Learning Facts: The Role of Response Confidence, Contemp. Educ. Psychol., 19(3), 251–265.
  87. Worcester R. M. and Burns T. R., (1975), A statistical examination of the relative precision of verbal scales, J. Market Res. Soc., 17(3), 181–197.
  88. Wren D. and Barbera J., (2013), Gathering evidence for validity during the design, development, and qualitative evaluation of thermochemistry concept inventory items, J. Chem. Educ., 90(12), 1590–1601.
  89. Yates J. F., (1990), Judgment and decision making, Englewood Cliffs, NJ: Prentice-Hall.
  90. Zimmerman B. J., (2000), Self-Efficacy: An Essential Motive to Learn, Contemp. Educ. Psychol., 25(1), 82–91.
  91. Zimmerman J., Broder P. K., Shaughnessy J. J. and Underwood B. J., (1977), A recognition test of vocabulary using signal-detection measures, and some correlates of word and nonword recognition, Intelligence, 1(1), 5–31.

This journal is © The Royal Society of Chemistry 2023