Einat Ben-Eliyahu
and
Elon Langbeheim
*
School of Education, Ben-Gurion University of the Negev, Beer-Sheva, Southern, Israel. E-mail: einatbe@post.bgu.ac.il; elonlang@bgu.ac.il
First published on 20th December 2025
The design of assessments shapes what we learn about students’ conceptual understanding. In the context of chemistry education, visual representations are important components of learning and assessment. To examine the role of representations in assessing students’ reasoning about chemical and physical changes, we developed two multiple choice questionnaires: one that represents the choice options in verbal form and one that represents isomorphic options – in pictures. The questionnaires included a second-tier rating scale of the questions’ general comprehensibility and the clarity of the pictures. The questionnaires were distributed among 374 eighth graders in two phases. In the first phase we found that on average, students performed slightly better on the verbal version, and some verbal items were significantly easier than their visual counterparts, but one item showed the opposite trend. Interviews revealed that visual representations exposed a wider range of ideas among respondents, and in some cases, created confusion. The second phase focused on the visual version and revealed significant positive correlations between clarity judgements and performance in questions with visuals such as graphs that represent the change in mass and molecular structures that represent the chemical formula, and no correlations on others. The analysis of the interviews, together with the clarity ratings, indicates that in these questions, visuals can be conceived as an additional layer of challenge, while other questions entail conceptual misunderstandings that are either exposed or concealed by cues in the external, visual layer.
Representations – whether visual or verbal – are “external”, since they refer to core concepts or “referents” (Rau, 2017) such as interparticle interactions, conservation laws or terms such as “reactants” or “mixtures”. Scientific competence is based on understanding these concepts, but since concepts are related to representations, it is crucial to be also competent in using representations (Kozma and Russell, 2005). Representational competence (RC) is demonstrated by the ability to identify and analyze key features of a representation or to select an alternative representation that conveys the same information as the original (Daniel et al., 2018; Küchemann, et al., 2021). Furthermore, the visual representations themselves often include symbolic components that require prior acquaintance or specific explication (Rau, 2017; Tonyali et al., 2023). Thus, developing competence in a scientific domain involves learning to relate representational features to domain-relevant concepts. For example, assessments that entail visuals of the shells and electron occupancy depicted in the Bohr model and ask students to infer the identity of the atom (Rau, 2015) evaluate both students’ competence in applying core chemistry concepts and their representational competence (Gkitzia et al., 2020; Ralph and Lewis, 2020).
Misconceived reasoning can reflect either naïve or intuitive ideas or fragmented knowledge. According to Vosniadou's (2013) framework theory, when students learn new scientific ideas (e.g., conservation of mass in chemical reactions), they attempt to integrate this new information into their existing knowledge structures, and thereby may develop fragmented, “synthetic” concepts. In our case, fragmented knowledge is evident when students state that “matter is conserved in chemical reactions” but fail to apply conservation when the reaction produces gas that seems to “disappear”, even though the system is closed. Synthetic models differ from “misconceptions” since the latter are held with high confidence, while synthetic ideas appear less coherent and students are less confident in stating them (Planinic et al., 2006). One way to identify such fragmented knowledge is by using isomorphic versions of questions that address the same concepts and share the same solution processes (Simon and Hayes, 1976). Although their conceptual core is the same, isomorphic questions use different representations for the choice options. For example, when assessing the concept of Newton's 3rd law in Susac et al. (2023), one version included images of two cars that collide and the arrows representing the forces the cars exert, whereas the graph version addressed the same collision scenario, but did not illustrate the cars, nor the direction of the forces. Susac et al. (2023) did not find significant differences in performance between the visual, verbal or graphical versions, but another study found that performance on items that represent forces using vectors was lower than that on isomorphic items representing forces using bar charts or verbal descriptions (Nieminen et al., 2010). Such variations in performance due to changes in representations indicate either fragmented knowledge or limited representational competence.
The inclusion of pictures of familiar situations in test items can reduce response time and improve accuracy, particularly when the task requires applying relationships between elements, especially among younger children (Sass et al., 2012). However, this effect is not universal: in more complex items with less familiar visuals, additional pictorial information may increase processing time, and the benefits depend on factors such as task complexity and the familiarity of the picture. There are several examples where features of visuals in assessments can mislead students (e.g., Kozma et al., 2000) and decrease performance (Ralph and Lewis, 2020). For example, Bruner et al. (1966) showed that children's predictions of the amount of liquid after pouring from narrow to wide beakers were significantly better when the beakers were covered with a screen than when they were on display. They explained that children tend to rely on misleading features of visual displays – the height of the water in a narrow/wide beaker, rather to the logic of conservation, and defined this tendency as ‘perceptual seduction’ (Bruner et al., 1966). Thus, inclusion of visual representations in assessments has a potentially important role in revealing fragmented conceptual understandings. To conclude, responses to assessment items reflect both (mis)understanding of the conceptual core, and (mis)understanding the representations that refer to it.
Despite extensive use of visual representations in science education, studies of their role in assessments of basic chemistry concepts are limited (Ralph and Lewis, 2020; Langbeheim et al., 2022, 2023). But even within this limited context, findings reveal that visual representations have mixed influences on performance. Items asking middle schoolers to choose a visual molecular representation that matches a macro phenomenon were generally easier than verbal descriptions of the same molecular mechanism (Langbeheim et al., 2023). However, college students’ performance on questions that addressed mole ratios was lower when visual, particulate representations replaced verbal statements with symbolic representations (Ralph and Lewis, 2020). Furthermore, when relating symbolic representations to visual ones (e.g., selecting molecular pictures that represent a chemical formula) the distracting visual options increase the difficulty of the items, compared to items where the picture is shown, and the selection is between symbolic options (Lin et al., 2016; Gkitzia et al., 2020). Does this “asymmetric” gap in performance indicate fragmented knowledge of core concepts, or limited representational competence? These mixed results and open questions regarding the role of visual representations in assessing student understanding of chemical reactions, especially among young learners – require further investigation.
Another way to interpret perceived difficulty/complexity ratings of learning or assessment tasks is through the lens of cognitive load theory. Cognitive Load Theory (CLT) examines cognitive processing by differentiating intrinsic load, determined by the complexity of the content vis-à-vis the level of the students, and extraneous load, which is related to the presentation of information in the instructional or assessment design (Sweller et al., 2011; Hoch et al., 2023). For example, when instructional visuals are too detailed or contain redundant, or irrelevant information, they induce extraneous load and may hinder performance (Butcher, 2006; Joo et al., 2021). To conclude, including visual representations in multiple choice chemistry assessments may impart extraneous cognitive load or can reveal flawed or fragmented reasoning. However, no study examined whether judgements of the question content and visuals can be used to separate challenges related to the conceptual core of the questions and those related to the cognitive load imparted by the details of the representations.
We view the responses to verbal and visual items as reflecting two potential sources of difficulty: conceptual understanding and representational issues that affect the comprehension of the question. To disentangle the two, we used both interviews in which students explain their reasoning, and a rating scale in which they assess their comprehensibility of the question and the visuals in the response options. Unlike confidence rating scales that address the entire response process (comprehending the question, interpreting the visuals, and applying conceptual knowledge) as one combined judgment, the clarity-of-question scale focuses specifically on students’ perceptions of how clear or comprehensible the item is. By separating students’ perceptions of the clarity-of-question from answer correctness, the scale allowed us to identify the source of errors more precisely: items with high clarity ratings by many students that responded incorrectly suggest that students understood the representation but struggled with the underlying concepts. Clarity ratings that correlate with performance suggest that the question's representation imposed additional difficulty, due to low representational competence, or overly complex visuals. Thus, the clarity scale can help to determine whether students’ difficulties arise from conceptual misunderstanding, representational challenges, or a combination of both. Triangulating the correlational analysis with interview data helped us to determine the sources of difficulty in different items.
2. How are 8th graders’ clarity ratings (of questions and illustrations) correlated with their performance on specific items, and what are the aspects that characterize items with significant correlations?
3. What are the main reasons for large performance gaps between verbal and visual versions, and the correlations between perceived clarity ratings and performance in some of the assessment items, according to students’ reasoning in interviews?
Phase 2: We re-administered the visual version of the questionnaire, this time with an additional rating scale specifically addressing the clarity of the illustrations in the response options (see Fig. 2). This addition allowed us to distinguish the comprehensibility of the visual representations and the clarity of the question prompt that pertained to its conceptual core.
In developing the assessment items, we sought to represent the same ideas visually and verbally in the response choices but acknowledge that the two formats cannot be entirely equivalent. We use the term isomorphic to indicate that each format addressed the same underlying conceptual framework, consistent with other studies comparing performance across representational formats (Kohl and Finkelstein, 2005; Nieminen et al., 2010; Susac et al., 2023). As is common in chemistry assessments, some items asked students to relate a macro-level representation to a molecular or particle-level one, while others presented symbolic chemical equations and asked for the appropriate submicro/nano representation (Gkitzia et al., 2020). In this study, items were classified according to the level of representation addressed in the response options. Macro-level items depicted observable, everyday phenomena (e.g., boiling water and melting butter). Nano-level items involved submicroscopic representations of particles and molecular structures (e.g., diagrams of particle motion or spacing in different states of matter). Symbolic items presented chemical equations that required interpreting the reactants and products in the conventional chemical notation. This classification follows common frameworks in chemistry education research (e.g., Johnstone, 1991; Gkitzia et al., 2020).
In each isomorphic pair, the stem of the question contained text and sometimes also a picture was identical across the verbal and visual formats. Fig. 1 illustrates such an item: the stem presents a description and picture of the decomposition of mercuric oxide. In the visual format, the prompt asked: “Which of the following illustrations correctly depicts the mass of the products at the end of the experiment?” followed by four pictorial options comparing the mass of products to that of the reactants. The verbal format presented the same stem but differed in the prompt (“What is correct to say about the total mass of the products?”), with response options expressed in text.
![]() | ||
| Fig. 1 Decomposition of mercuric oxide by heating – an item requiring students to apply conservation of mass. Visual version (top) and verbal version (bottom). | ||
The questionnaires were based on the chemistry curriculum for grades 7 and 8 in Israel that addresses the law of conservation of mass, phase change, diffusion, mixing, and chemical reactions. Each questionnaire initially consisted of 15 items, some of which were derived from previous research (Hadenfeldt et al., 2016; Langbeheim et al., 2023) and others (e.g., the NH3 item, Fig. 4) from national tests for the 8th grade. Both visual and verbal versions of the items were reviewed for evidence in support of content validity by a panel of science teachers with over 10 years of teaching experience and three experts in science education. The reviewers were asked to verify the alignment between the verbal and visual response options and to comment on the quality of the items in terms of their alignment with 8th-grade curricular expectations. After reviewing, three items were removed since they required knowledge beyond the 8th grade curriculum, which reduced the questionnaire to a total of 12 items. A detailed table specifying the source of each questionnaire item is provided in Appendix A4.
To assess the comprehensibility of the items, we added a “clarity” rating scale after each item. The clarity scale ranged from 1 (unclear) to 5 (very clear), as shown in Fig. 2.
![]() | ||
| Fig. 2 Clarity rating scales used to evaluate both questions and illustrations (1 = unclear, 5 = very clear). | ||
Phase 1: The verbal and visual questionnaires were administered to 200 eighth-grade students from schools with similar socioeconomic backgrounds. The questionnaires were distributed online using the Qualtrics platform. Each class responded to either the verbal or the visual version. To ensure comparability of the student samples across formats, the survey included background rating items about the frequency of use of different teaching practices (e.g., “My teacher uses computer simulations to show the movement of particles”). The scale ranged from 1 (“almost every lesson”) to 5 (“never”). Entire classes were assigned to one questionnaire format, so that all students within a class received either the verbal or the visual version. The assignment of classes to questionnaire formats was random and not based on students’ prior achievement, nor socioeconomic background. Socioeconomic similarity across schools was determined based on the Ministry of Education classifications of school catchment areas.
Phase 2: The revised visual version (with an additional clarity scale for illustrations) was administered to 174 eighth-grade students from two schools. The rationale for Phase 2 was to disentangle potential sources of difficulty in the visual format. In Phase 1, the clarity scale referred to the question as a whole, making it impossible to distinguish whether low clarity ratings reflected the phrasing of the prompt or the comprehensibility of the illustration. By adding a separate rating scale in Phase 2, we were able to differentiate between these two aspects and to identify whether students’ difficulties stemmed from the wording of the item or from interpreting the visual representation. This provided additional insight into the specific role of illustrations in shaping students' performance. The assessments were administered during regular science classes at school, under direct supervision of one of the researchers. Students completed the Qualtrics test individually on a voluntary basis; no incentives or grade implications were involved. Prior to participation, the importance of the study was explained, and students were encouraged to provide thoughtful responses. Students who did not wish to participate were given an alternative quiet classroom activity prepared by the teacher.
The datasets from phases 1 and 2 were analyzed separately. Phase 2 was conducted almost a year after phase 1 and included slight revisions to the questionnaire: items with low reliability from phase 1 were removed and clarity-rating scales were added. Because of these changes, the datasets were not merged. Phase 1 results guided the selection of questions for the qualitative interviews.
Independent-sample t-tests were used to compare average total scores and average clarity ratings between the verbal and visual versions. In addition, non-parametric tests were used to compare performance and clarity ratings for individual items. Finally, gamma correlations between clarity ratings and performance were examined both for total scores and for individual items. Fig. 3 is a scatter plot of the clarity ratings of performance vs. the illustrations’ clarity (Pic_clarity) for two items. Each point represents the result for a student who assigned the clarity rating shown on the x axis. The scatter plots include a small amount of random “jitter” to the position of data points to prevent them from overlapping. A score of 1 on the vertical axis represents a correct response, and a score of zero – an incorrect response.
As shown in Fig. 3, in the CO2 item there is a clear increase in correct responses as clarity ratings rise, whereas in the Hg item no such trend is observed. Two separate gamma correlations were computed for each item from the phase 2 dataset: one between students’ performance (1 = correct, 0 = incorrect) and the perceived clarity of the question (rating 1 to 5), and another between performance and the perceived clarity of the accompanying illustration. Gamma correlations are nonparametric measures of association used when both variables are ordinal, but the distances between categories are not necessarily equal—as in our case (Goodman and Kruskal, 1954).
| Visual (N = 106) | Verbal (N = 94) | Difference (SIG) | |
|---|---|---|---|
| Clarity mean (SD) | 3.27 (0.98) | 3.44 (0.86) | t = 1.31, p = 0.26 |
| Performance mean (SD) | 0.437 (0.23) | 0.482 (0.23) | t = 1.35, p = 0.177 |
In phase 2, the average clarity ratings of the visual version were slightly higher than those of phase 1 (M = 3.39) and were higher than the clarity ratings of the illustrations (M = 3.15). The correlation between clarity and performance was similar to that of phase 1, (r = 0.32, p = 0.012) and the correlation between the comprehensibility of the illustrations and overall accuracy was lower (r = 0.22, p = 0.048). Unsurprisingly, the clarity ratings of the questions, and of the illustrations, were highly correlated with each other (r = 0.77, p < 0.001).
While overall averages of clarity ratings and performance provide a general overview, item-level analyses revealed how clarity and performance patterns differed across individual items and representation types.
Table 2 shows the average performance (percentage of correct responses) and the average clarity of the items, and the differences between formats, both in terms of performance and in terms of perceived clarity. The NH3, butter and mercury oxide items show a significant difference in performance (p < 0.001, p = 0.009 and p = 0.011, respectively), with the verbal version yielding substantially higher accuracy than the visual version in the Hg and NH3 items and lower in the melting butter item.
| Item | Pct correct visual | Pct correct verbal | Sig (Mann–Whitney) | Avg clarity visual | Avg clarity verbal | Sig (Mann–Whitney) |
|---|---|---|---|---|---|---|
| Balloon (macro to nano) | 0.75 | 0.681 | 0.287 | 3.564 (1.268) | 3.690 (1.103) | 0.474 |
| MgCl (symbolic to macro) | 0.543 | 0.484 | 0.408 | 3.250 (1.266) | 3.229 (1.151) | 0.907 |
| CO2 (symbolic to nano) | 0.472 | 0.462 | 0.896 | 3.383 (1.201) | 3.277 (1.193) | 0.558 |
| Salt (conservation of matter − graph) | 0.543 | 0.677 | 0.054* | 3.370 (1.512) | 3.553 (1.150) | 0.381 |
| NH3 (symbolic to nano) | 0.295 | 0.670 | <0.001** | 3.253 (1.235) | 3.138 (1.241) | 0.537 |
| Flower (macro to nano) | 0.387 | 0.383 | 0.956 | 3.484 (1.138) | 3.471 (1.228) | 0.943 |
| Tea (macro to nano) | 0.481 | 0.468 | 0.854 | 2.830 (1.234) | 3.694 (1.155) | <0.001** |
| Bubble (macro to nano) | 0.267 | 0.234 | 0.597 | 2.912 (1.180) | 3.671 (1.106) | <0.001** |
| Butter (macro to nano) | 0.358 | 0.196 | 0.009** | 3.292 (1.205) | 3.581 (1.132) | 0.097* |
| Hg (conservation of matter) | 0.324 | 0.489 | 0.012** | 3.056 (1.309) | 3.012 (1.171) | 0.818 |
| Metal ball (conservation of matter − graph) | 0.533 | 0.596 | 0.377 | 3.222 (1.130) | 3.409 (1.283) | 0.304 |
| Egg shells (conservation of matter + graph) | 0.365 | 0.473 | 0.127 | 2.807 (1.153) | 3.256 (1.160) | 0.011** |
In addition, the tea, bubble and egg shell items exhibited significant differences in clarity ratings, with clearer verbal versions than the visual ones, but the average performance on these items was similar. Half of the items (e.g., metal ball, balloon, and MgCl) did not differ significantly on either measure; this indicates that, overall, the representation format affected only a subset of items.
Table 3 summarizes the relationship between the clarity ratings of the questions and the illustrations to address the 2nd research question, namely how are clarity ratings (of questions and illustrations) correlated with their performance on specific items, and what are the aspects that characterize items with significant correlations? The clarity ratings resembled those of phase 1, with the bubble, butter and flower items rated as relatively clear questions, despite low performance on these items. Gamma correlations were used to relate question clarity (Table 3 – left) and illustration clarity (Table 3 – right) to students’ accuracy.
| Question clarity | Illustration clarity | |||||
|---|---|---|---|---|---|---|
| Mean | Gamma | p_value | Mean | Gamma | p_value | |
| Ball (heat) | 3.68 | 0.22 | 0.06* | 3.78 | 0.19 | 0.10* |
| Balloon | 3.08 | 0.15 | 0.28 | 3.32 | 0.04 | 0.79 |
| Bubble | 3.30 | −0.01 | 0.96 | 3.58 | 0.02 | 0.89 |
| Butter | 2.82 | 0.38 | <0.001** | 3.16 | 0.19 | 0.14 |
| CO2 | 3.10 | 0.21 | 0.08* | 2.81 | 0.29 | 0.02** |
| Egg shells | 3.38 | 0.24 | 0.03** | 3.24 | 0.47 | <0.001** |
| Flower | 3.66 | 0.11 | 0.39 | 3.28 | 0.04 | 0.79 |
| Hg | 3.53 | 0.16 | 0.24 | 2.89 | 0.08 | 0.53 |
| MgCl | 3.49 | 0.11 | 0.31 | 2.89 | 0.05 | 0.66 |
| NH3 | 2.89 | 0.32 | 0.01** | 3.03 | 0.41 | <0.001** |
| Salt | 3.33 | 0.41 | <0.001** | 3.27 | 0.22 | 0.07* |
| Tea | 3.1 | 0.29 | 0.024** | 2.81 | 0.07 | 0.58 |
The clarity of the illustrations was significantly correlated with performance only for the CO2 (p = 0.017), NH3 (p = 0.003), and egg shells in vinegar (p = 0.0002) items and the dissolving salt (p = 0.07) and metal ball (p = 0.10) were near significance. These items asked to either relate a chemical formula to a molecular illustration (CO2, NH3) or to select a graph of the change in mass during a chemical or physical process (metal ball, salt and egg shells). In these two contexts, the perceived clarity of the illustration seemed to be related to students’ performance.
As for the clarity of the question ratings, salt (γ = 0.41; p = 0.000), tea (γ = 0.29; p = 0.024), NH3 (γ = 0.32; p = 0.012), butter (γ = 0.38; p = 0.004), and egg shells (γ = 0.29; p = 0.005) demonstrated significant correlations with performance, while CO2 (γ = 0.21; p = 0.078) and the metal ball item (γ = 0.22; p = 0.06) were marginally significant. In two items – dissolving tea and melting butter – performance was correlated with the clarity of the question, but not the clarity of the illustration—indicating that the difficulty experienced by some of the students stemmed from understanding the scenario as a whole, rather than decoding the image. In half of the items (e.g., bubble, Hg, MgCl2, balloon, and flower) we found no correlations with either the clarity of the question or the illustration. The correlations of these items are near zero, indicating no connection between clarity ratings and performance as shown in the right panel of Fig. 3. The large proportion of students who chose the incorrect response, but perceived the question as clear, implies that the external visual layer of the question was not perceived as problematic, even for students’ with low conceptual misunderstanding.
| Item | Correct % (visual) | Correct % (verbal) | Consistent-incorrect category (−vis → −ver) | Fragmented category (−vis → +ver) (+vis → −ver) | Correct category (+vis → +ver) |
|---|---|---|---|---|---|
| Hg | 4/11 | 7/11 | “Because after they heated it, it was like it became easier… the material sort of turned into something that wasn’t really weighed. Here it's lighter and the mass went down, because it was as if it no longer sat on the bottle. It's not in the bottle.” → “The mass of the products is smaller than the mass of the mercuric oxide. It's like the substances, because they were heated, kind of evaporated or melted” (Student 2, medium achiever) | “I think it's answer B. Because, if the balloon inflates it means air went in, and then it weighs more. Because, like, if the balloon inflates it means air entered it, and then there is more.” → “I think it's answer C. The mass of the products is equal to the mass of the reactants. Because, like, if they sealed it then it didn’t go up or down, it kind of stayed the same.” (Student 4, medium-achiever) | “No material was added or removed, so there's no reason that after the experiment (the heating) the amount would decrease or increase. I think they’re supposed to be equal, like, it's the same both at the beginning and after.” → “It's like no material was added or removed, so there's no reason for their mass to increase or decrease.” (Student 1, high achiever) |
| Butter | 4/8 | 4/8 | “It's not answer B because it shows that it's a solid… The answer is that it's something in between, in my opinion. And it's also not (answer D) because that's, like, really a liquid. A. It's answer C, between solid and liquid (a gas state).”→ “Answer C: they spread out and create large spaces between them.” (Student 10, low achiever) | “Answer A (the correct answer), because the arrows show that it is liquid and spreading to the sides. Distractor B is a solid, and distractor C looks too separated since the spaces there seem large.” → “Answer B” is correct (description of the solid), because when the butter is liquid, the particles spread out more, and when it is solid they are like an atomic lattice and connected. Answer D (the correct one) is more like the middle of the process.” (Student 6, high achiever) | “Because, like, they’re not like air that spreads everywhere. They do move and switch places with each other.” → “At first (solid) they were only moving in place and were closer together, and then they started moving more and had more space between them.” (Student 8, high achiever) |
| NH3 | 2/11 | 9/11 | “I think it's answer B. Like, it seems more logical to me because here it's four H, and then the N has four H around it, and those (Cl) are single because there's only one chlorine each.” → “I think it's B: the reactants are molecules made of a nitrogen atom connected to four hydrogen atoms, and in addition, single chlorine atoms. Because again, it's basically the same thing. Here it turns into 4H, then they are connected to the N, and there's the Cl which is kind of alone.” (Student 5, low achiever) | “Answer C. Like, they connected, and it was kind of one, and then it connected with the other, and then it didn’t really break apart, and so they are joined together.” → “Answer A. Because it says it's three hydrogens and one nitrogen, and the second molecule is one chlorine and one hydrogen.” (Student 9, medium achiever) | “The chemical symbol fits both types of molecules in the reactants.” → “Because there are two types of molecules, and in the end a solid was formed – one is nitrogen with three hydrogens, and the other is chlorine with hydrogen.” (Student 1, high achiever) |
Table 4 summarizes students’ reasoning patterns across visual and verbal formats. The central square under the ‘fragmented’ category represents instances where students changed their explanations when the question format shifted from visual to verbal (or vice versa). It should be noted that consistency or fragmentation is not a stable attribute of particular students but depends on the representational context; for instance, a student might reason consistently about the mercury oxide item but show fragmented reasoning in the NH3 question. Some association between students’ achievement levels and their reasoning patterns was observed. For example, the three students classified as high-achievers, answered both verbal and visual versions of the mercury-oxide and butter items correctly, and two of them answered both versions of the NH3 item correctly. Middle and lower-achieving students more often displayed fragmented or synthetic reasoning. These differences suggest that general academic achievement may contribute to the stability and coherence of students’ conceptual reasoning, although representational format still played a central role in shaping their responses.
The NH3 + HCl reaction item (Fig. 4) showed a significant correlation between the perceived clarity of the visuals and accuracy (Table 3), and a significant difference between verbal and visual versions with 29.5% of students choosing the correct answer in the visual condition, compared to 67% on the verbal one (see Table 2). Similarly, in the interviews, only (2/11) responded correctly on the visual version and (9/11) on the verbal version. The common incorrect distractor for the visual item was a representation of the product of the reaction: two NH4Cl molecules in a solid state (option D in Fig. 4). Students who answered correctly on the verbal version but incorrectly on the visual one revealed in the interviews that they knew that the reactants in the chemical equation are shown on the left side of the chemical equation (“It's NH3; it has three hydrogens and HCl” – Student 11) but still chose the distractor describing the product “because the gases combined into a solid; they are connected like a solid” (Student 7). It seems that for these students the details of the pictures representing the combined product, and the index (s) that represents a solid in the chemical equation created a confusion between reactants and products. In other words, they viewed the picture of NH4Cl as the molecular manifestation of the expression NH3 + HCl. This type of confusion can be interpreted as evidence of either cognitive load imparted by the visuals or limited representational competence that is elicited by the visual version.
The item concerning the decomposition of mercury oxide (Fig. 1) exhibited no correlation between perceived clarity and performance (Table 3), with 48.9% choosing correctly on the verbal version, and only 32.4% on the visual one (Table 2). This item illustrates the chemical reaction of mercury oxide powder that decomposes into mercury that gets adsorbed to the beaker's walls and gaseous oxygen that inflates a balloon. Similar to the larger sample, most interviewees (7/11) chose the correct response (that mass is conserved) on the verbal version compared to only (4/11) in the visual one. Three of the interviewees demonstrated fragmented concepts: while in the verbal version they referred to the conservation of mass correctly, in the visual version the additional details of the inflated balloon triggered the idea that the inflation of the balloon caused either an increase or a decrease in mass (the verbal format did not mention the inflated balloon). These students chose one of the pictures of the unbalanced scales, with the flask of the reactants and a deflated balloon on one side and the same vessel with the products and an inflated balloon on the other. The interviews revealed that the image of the inflated balloon elicited the idea that the mass had either decreased or increased, as stated by student 4: “I think it's answer B. Because, like, if the balloon inflates it means air went in, and then it weighs more….”, while in the verbal version she explained: “I think that the mass of the products is equal to the mass of the reactants. Because, like, if they sealed it then it didn’t go up or down, it kind of stayed the same.” (Student 4). Similarly, student 6 acknowledged the balloon in her response to the visual version: “I think the total mass is smaller because the balloon inflated; it turned into gas. So, because they heated it, it became lighter”, and in the verbal version she chose the correct response and said: “It (the mass) might be equal because they just heated it and heating it doesn't make the mass heavier”. The choice of the correct response seems to misinterpret the chemical process as mere heating that does not change the nature of the material, again revealing fragmented knowledge.
In the melting butter item (Fig. 5), students performed significantly better on the visual version (35.8% correct) than on the verbal one (19.6% correct), and the clarity of the item was correlated with students’ performance (Table 3). The relatively low performance on this item indicates limited understanding of particle motion and configuration during phase change. Unlike the Hg and NH3 questions, it also shows that the pictorial options can support and streamline students’ reasoning. The interviews reveal that one reason for the difference between versions is the ambiguity regarding particle configuration in the melted butter, which is evident in the verbal statements regarding interparticle distances. For example, many students (55.2%) chose the incorrect option B: “Butter particles move and spread out, creating large spaces between them” although large distances between particles characterize gases not liquids. The large distances between particles were more salient in the visual representation and so most students rejected option “C” that represented spread out particles as in a gas with only 23% of the participants selecting this distractor. For example, student 6 stated: “option C looks too separated, since the spaces there seem too large” (Table 4). However, the interviews also reveal some hesitation regarding the actual phase of the butter that is depicted by the question scenario. This is evident in the reasoning of the same interviewee regarding the verbal version: “It's either B or D. It's probably B because D is more like the middle of the process. When the butter is liquid, they [the particles] spread out more, and when it is solid, they're like in an atomic lattice and connected. Now [when it is liquid] they have more space and can move. Answer D [the correct one] represents the middle of the process; where it is not liquid yet.” This excerpt shows that she knew the differences between solid and liquid, but the vagueness of the verbal distractors created a hesitation regarding which of the two options is a better representation of the phase of the macroscopic butter depicted in the question. This hesitation may indicate that the scenario of the question was unclear, leading to low performance and perhaps explains the significant correlation between question clarity and performance.
Slightly higher performance on verbal representations was also reported in the context of introductory university physics courses, where visual items (especially those that contain unfamiliar symbols such as vectors) were more cognitively demanding (Meltzer, 2005; Nieminen et al., 2010; Susac et al., 2023). Still, the populations of these studies were university or high school students, whereas we focus on the eighth graders – a significantly younger cohort. University and high school students have more experience in interpreting visual representations and making connections between different forms of representations than middle schoolers. Consequently, the smaller differences between verbal and visual versions among older students are probably related to greater experience with representations, (i.e., representational competence) among older students (Garcia-Mila et al., 2014).
Still, for almost half of the items, the correlations between clarity and performance were very low. One explanation for this relatively low correlation is that the middle school students in this study often overlooked details that were pertinent to the question scenario and still believed they understood it, hence the higher-than-average ratings even among low performers. This is similar to “meta-ignorance” – an unwarranted high sense of confidence among students with low scores, in Brandriet and Bretz (2014). For example, in the bubble item, despite high clarity ratings, students commonly believed that the bubbles in boiling water are made of air (rather than water vapor) demonstrating a persistent misconceived idea (e.g., Osborne and Cosgrove, 1983; Johnson, 1998). Similarly, Potgieter et al. (2010) showed that confidence judgments of items with (incorrect) distractors that most students found very convincing did not correlate with performance. Thus, the low correlations between clarity and performance on these items stem from students who misunderstood the concepts and performed poorly, yet believed they understood the questions and the visuals rather well.
In a couple of cases such as the melting butter item (Fig. 4), performance on the visual version was better than that on the verbal one, although the clarity ratings of the verbal version were higher. On the verbal version, most students selected the option that stated that particles “spread out” and occupy larger spaces in the liquid; about a quarter of students viewed the particles of the liquid as “soft and moving like jelly.” Both descriptions reflect a mental model that endows individual particles with resemblance to the macroscopic appearance of the material (Johnson, 1998). The visual format helped students reject some of these ideas when representations revealed that large spaces resembled a model of gas rather than a liquid, or when the shape of jelly-like particles seemed too odd to represent particles. This demonstrates how visual representations cued information that “masked” fragmented knowledge. The butter item shows that visual representations can also “flag” problematic answers so that children avoid selecting them thereby superficially improve performance. Interestingly, this item also exhibited a correlation between students’ performance and their clarity ratings of the question – indicating that students who selected the incorrect responses also rated the clarity of the question lower. This may indicate that the difficulty for low performers was not rooted in the pictures of the choice options, but rather in the scenario of the question as whole.
Finally, the interview data corroborated the findings that awareness of the complexity of visuals may reflect cognitive load, as in the NH3 item. In this case, many of the students knew that reactants are the materials shown on the left side of the chemical reaction, and the products – on its right, but when confronted with the detailed molecular, space-filling pictures alongside the chemical equation, they got confused by the abundant information and chose the wrong answer. This shows that the cognitive demand created by the visuals might have blurred students’ ability to apply the core concepts, which explains the significantly lower performance on the visual version of this item, when compared to the verbal one. These findings can also clarify the reciprocal yet distinct relationship between conceptual understanding and representational competence (Kozma and Russell, 2005; Rau, 2017).
To conclude, visual representations can be viewed as the external layer mediating activation of the core concepts, and clarity ratings serve as a diagnostic lens for this mediation. The external layer consists of pictorial and symbolic cues that activate students’ core concepts. For instance, in the mercury oxide item, the image of the inflated balloon drew students’ attention away from the conservation principle, illustrating how the visual layer can activate conceptual misunderstanding. When students understand the core concept but struggle with the external layer, their clarity ratings are directly tied to the accuracy of their responses (as in the NH3 item). When the external, visual layer exposes flawed reasoning that students are not aware of, clarity ratings are detached from performance (as in the Hg item). And when perceptual cues in the visual layer help some of the students avoid incorrect choice options, clarity ratings of the question are correlated with performance, but the clarity of the illustrations is not (as in the melting butter item). Thus, visual representations can expose, mask, or overload conceptual reasoning, and clarity ratings help distinguish between these modes of interaction.
Distinguishing students’ reactions to external representations and their understanding of core concepts resembles the separation between conceptual sense-making and perceptual fluency in learning with multiple external representations (Rau et al., 2015). Rau (2015) showed that students who learn by building their perceptual fluency via seemingly superficial engagement with representations improve their chemistry knowledge but only when they have prior conceptual knowledge. This too supports the layered view: perceptual learning with visual representations is based on an initial layer of conceptual knowledge, and in order to learn from external representations, students need an initial knowledge structure for noticing variations and patterns in these representations (Bransford and Schwartz, 1999; Rau, 2015). Our study adds that visual representations can reduce performance compared to the verbal version either because they expose conceptual misunderstandings or because they blur the clarity of the question, due to limited representational competence. The two influences can be distinguished using an analysis of clarity ratings: flawed performance due to limited representational competence or fluency is characterized by correlations between question clarity and performance and cases of fragmented conceptual understanding are not correlated with clarity. In rare cases, visual representations can also mask conceptual misunderstanding and increase performance, when familiar images flag incorrect distractors.
In addition, responses to visual items should be examined for reliance on superficial details. When choices are driven by superficial (but familiar) pictorial cues, the item may not capture fragmented knowledge. In such cases, visuals should be replaced with verbal forms focusing on the core concept. Finally, clarity ratings can provide a lens for distinguishing conceptual understanding and representational demands, supporting a layered view of competence that combines conceptual understanding with representational fluency.
| This journal is © The Royal Society of Chemistry 2026 |