Use of machine learning to analyze chemistry card sort tasks

Logan Sizemore *a, Brian Hutchinson ac and Emily Borda b
aDepartment of Computer Science, Western Washington University, Bellingham, WA, USA. E-mail: sizemol@wwu.edu; Brian.Hutchinson@wwu.edu
bDepartments of Chemistry and Science, Math, and Technology Education (SMATE), Western Washington University, Bellingham, WA, USA. E-mail: bordae@wwu.edu
cComputing and Analytics Division, Pacific Northwest National Laboratory, 902 Battelle Blvd., Richland, WA 99354-1793, USA

Received 24th January 2022 , Accepted 31st October 2023

First published on 7th November 2023


Abstract

Education researchers are deeply interested in understanding the way students organize their knowledge. Card sort tasks, which require students to group concepts, are one mechanism to infer a student's organizational strategy. However, the limited resolution of card sort tasks means they necessarily miss some of the nuance in a student's strategy. In this work, we propose new machine learning strategies that leverage a potentially richer source of student thinking: free-form written language justifications associated with student sorts. Using data from a university chemistry card sort task, we use vectorized representations of language and unsupervised learning techniques to generate qualitatively interpretable clusters, which can provide unique insight in how students organize their knowledge. We compared these to machine learning analysis of the students’ sorts themselves. Machine learning-generated clusters revealed different organizational strategies than those built into the task; for example, sorts by difficulty or even discipline. There were also many more categories generated by machine learning for what we would identify as more novice-like sorts and justifications than originally built into the task, suggesting students’ organizational strategies converge when they become more expert-like. Finally, we learned that categories generated by machine learning for students’ justifications did not always match the categories for their sorts, and these cases highlight the need for future research on students’ organizational strategies, both manually and aided by machine learning. In sum, the use of machine learning to analyze results from a card sort task has helped us gain a more nuanced understanding of students’ expertise, and demonstrates a promising tool to add to existing analytic methods for card sorts.


Introduction

Making sense of new information is dependent on one's ability to organize knowledge into a framework that facilitates its retrieval and use (Gentner and Medina, 1998; Bransford et al., 2000). Card sort tasks, in which individuals are asked to group cards representing problems by similarity, have been used as a means to understand how students organize their knowledge and investigate the development of expertise, e.g., (Chi et al., 1981).

In chemistry, a specific dimension of one's knowledge organization is their ability to connect three levels of understanding or representation described in Johnstone's (1982) chemistry triplet: submicroscopic, macroscopic, and symbolic. This triplet was originally proposed to expose the difficulty that linking such levels of representation and understanding poses to a novice, and establish that connections between these levels requires instructional support. In an earlier study, we developed a card sort task to investigate students’ ability to connect these three triplet levels by recognizing common principles among problems represented on different levels (Irby et al., 2016). Although on average, the sorting data showed a positive correlation between the amount of one's formal chemistry instruction and their ability to sort by underlying principles, there were several sorts and verbal justifications for sorts that were difficult to make sense of. We wondered if there were patterns in students’ thinking that were not easily captured by this card sort and associated analytic techniques.

Here we report the use of machine learning techniques to analyze a larger dataset captured with the same card sort task. The results from the use of machine learning to analyze both the sorts and the sorting justifications suggest novel ways of approaching the card sort task, with associated implications for students’ knowledge structures when learning chemistry. Our method also adds tools and analytic approaches to support future studies using card sorts.

Theoretical framework

In this work we investigate the use of a card sort task to investigate students’ expertise in chemistry. A dimension of expertise we drew upon to design and interpret results from this task is the idea of schemas. Schemas are described as mental representations of a problems created by individuals engaged in problem-solving tasks (Eysenck and Keane, 2005; Galotti, 2014). Research suggests schemas change with learning (Revlin, 2012), and that the more abstract and principle-based, vs. context-bound, the schema, the more readily an individual can apply it across problems with similar underlying principles, even if those problems might look distinct on the surface (Chen, 1999). Thus, the development of expertise involves, at least in part, the modification of schemas to become more abstract and generalizable, a process called re-representation (Gentner, 2005).

A separate but related lens through which to view expertise is that of knowledge organization. The cognitive science literature tells us expertise requires not only a breadth of factual knowledge, but also a conceptual framework that organizes the knowledge in a way that facilitates retrieval (Bransford et al., 2000). A well-defined knowledge structure facilitates long term storage and easy retrieval of factual information (Mayer, 2012). Studies show knowledge structures become more interconnected through the acquisition of expertise (Acton et al., 1994). This finding may also relate to schemas becoming more abstract as an individual gains expertise, as discussed above, where a more interconnected knowledge structure is more abstract and generalizable. Studies performed using an organic chemistry reactions card sort task show consolidation of knowledge structures with more expertise (Galloway et al., 2018). Card sorts are also a way of investigating categories that guide one's knowledge structure. Open-ended card sort tasks have been used to surface different categories or “bins” individuals use to express similarities between problems or information represented on the cards (Yamauchi, 2005). These categories can reflect the quality of a learner's knowledge structure, in terms of how well information linked together in a category facilitates the application of that information to a range of different types of problems and tasks. For this reason, card sort studies often ask individuals to name the groups they create, and almost always have a qualitative component asking the individual to justify their groupings.

The chemistry triplet, originally proposed by Johnstone in 1982 (Johnstone, 1982), combines ideas about knowledge organization and representation in describing expertise in chemistry. As expertise grows, individuals are able to better connect the three levels of understanding and representation in the triplet: macroscopic, submicroscopic, and symbolic (Irby et al., 2016). The macroscopic level represents human-scale observations, which can be aided by instrumentation. An example of a macroscopic scale set of observations is that when table salt is added to water, it yields a clear, colorless liquid that nevertheless tastes different from pure water. In addition, this mixture conducts electricity, whereas pure water does not. The submicroscopic scale usually represents a small particle model or explanation for what is happening with molecules, atoms, and/or ions, to produce the results observed on the macroscopic scale. A submicroscopic-scale explanation would be that sodium chloride splits up into positively- and negatively-charged ions that are surrounded by water molecules which are attracted through their partial charges to the ions. These ions can move in response to external charges, to which they are attracted or repelled. The symbolic scale expresses observations or small particle models with symbols, such as mathematical or chemical equations, or graphs. A symbolic-scale description of the dissolution of salt in water might be expressed through this chemical equation: NaCl(s) → Na+(aq) + Cl(aq). While a novice might think about the chemical equation and whatever problem solving tasks they might encounter involving it as separate from a drawing, simulation, or even mental representation of small particles of sodium chloride breaking up into charged ions and being surrounded by water, and might further separate the observations of salt water forming from the dissolution of salt in water and conducting electricity from the equation and/or small particle model, an expert tacitly links the three domains of knowledge to form a knowledge structure that is explanatory, predictive, and generalizable. Studies have shown, for example, that novices can successfully work problems oriented at the symbolic level while failing to successfully work problems that integrate representations of more than one level of the chemistry triplet (Jaber and BouJaoude, 2012).

The problem of connecting components of the chemistry triplet is in part related to an individual's ability to decode representations, an activity that is not necessarily straightforward to novices (Kern et al., 2010). In addition, the process of making such connections is thought to be more than a question of representational competence, and in fact, is part of what it means to understand concepts and principles in chemistry (Johnstone, 1982; Taber, 2013). For example, one's ability to visualize sodium and chloride ions being solubilized in water while observing salt being added to water is evidence of their ability to create a model or explanation for what is going on that can be used to make predictions. Indeed, creating and using models, explanations, and arguments is central to any scientific endeavor (National Research Council, 2012) and in chemistry, necessarily involves linking these three levels of understanding/representation. In this sense, connecting the components of the chemistry triplet in the service of solving a problem, making a prediction, or constructing an explanation can be thought of as developing an abstract schema that connects two or more parts of the chemistry triplet.

Card sorts as methods for investigating expertise

Individuals’ schemas, knowledge organization, and specifically in chemistry, their ability to integrate components of the chemistry triplet, are all dimensions of expertise that are difficult to investigate through traditional knowledge inventories or surveys. This is because these dimensions focus on how students connect and use different components of their knowledge; e.g., what they do with that knowledge, not necessarily to what extent they have acquired that knowledge. We relied on a card sort task because when students sort cards, they are demonstrating their thinking about how concepts are related, or unrelated, to each other.

Classification and card sort tasks have been widely used to investigate expertise in various domains (Chi et al., 1981; Smith, 1992; Kozma and Russell, 1997; Lajoie, 2003; McCauley et al., 2005; Domin et al., 2008; Stains and Talanquer, 2008; Lin and Singh, 2010; Mason and Singh, 2011; Smith et al., 2013; Irby et al., 2016; Krieter et al., 2016; Graulich and Bhattacharyya, 2017; Davidson et al., 2022; Lapierre et al., 2022). A finding on which much of this literature has converged is that novices tend to focus on surface features and experts focus on underlying principles and concepts when sorting cards displaying problems (Singer et al., 2012). This finding, developed in physics (Chi et al., 1981) and reproduced in biology (Nehm and Ridgway, 2011; Smith et al., 2013), chemistry (Irby et al., 2016; Krieter et al., 2016; Graulich and Bhattacharyya, 2017; Galloway et al., 2018), and geology (Davidson et al., 2022), is undergirded by research on schemas from the problem solving literature.

Card sort data are often quantitatively analyzed by how close they are to the canonical sorts upon which their design is based; for example, edit distances (Deibel et al., 2005; Smith et al., 2013) describe the smallest number of rearrangements necessary to change the sort into one or the other canonical sorts, while sorting coordinates (Irby et al., 2016) describe the sort's “distance” from the canonical sorts. Subsequent researchers have used edit distance in conjunction with normalized minimum spanning-trees to measure the orthogonality of sorts given by participants of an administered single-criterion card sort task (Fossum and Haller, 2005; McCauley et al., 2005). Gelphi maps have also been used to graphically depict common groupings (Galloway et al., 2018; Lapierre et al., 2022). In order to find clusters of common sorts produced by participants in a card sort study, cluster techniques, including k-means clustering, have been effectively employed (Ewing et al., 2002; Bussolon, 2009; Paul, 2014; Shinde et al., 2017; Stiles-Shields et al., 2017; Macías, 2021; Paea et al. 2021; Carruthers et al., 2019; Gläscher et al., 2019). Our work describes the use of clustering on the free-form responses students gave for their card sort, which to our knowledge, has not been done before in the context of card sort tasks. Card sort tasks are often accompanied by interviews asking students to name and/or justify their sorts. These qualitative data are also most often analyzed by what type(s) of canonical sorts are represented. While these can be effective ways of determining whether a student organizes their knowledge around underlying principles or surface-level representations, often in practice neither a student's knowledge nor their sort are purely principle- or representation-based. Any method of analysis relying solely on how close a given student's sort is to a canonical sort will lose the potentially nuanced justification for that student's sort. These types of analyses also assume students who sort alike think alike, and that there is some unique relationship between the way a student sorts and their schemas. Instead, it would be beneficial for a researcher to be able to utilize a student's language justification of their sort, a potentially richer signal of their thought processes. Leveraging the language used to justify a student's sort may provide insight into ways of organizing knowledge that combine, or fall outside of, the canonical principle or representation categories upon which the cards are based.

Although qualitative data analysis is crucial to adding the nuances described above to card sort studies, it is labor intensive. In most of the card sort studies cited here, manual coding was used to analyze qualitative data coming from participants’ verbalization or written justifications for their groupings. In some cases, codes were used to develop emergent themes for students’ sorting patterns (e.g., Stains and Talanquer, 2008; Galloway et al., 2018), while in other studies codes reflect hypothesized novice and expert sorts upon which tasks were originally created (e.g., Smith et al., 2013; Irby et al., 2016). These methods can be highly labor intensive and difficult to apply consistently, especially with large datasets, and may cause researchers to miss ways of grouping that were not initially hypothesized when the card sorts were created. As Fincher and Tenenberg (2005) highlight, there are computational techniques that can help in overcoming the challenges associated with analyzing card sort data, especially for large datasets. In this paper, we further extend this line of work by describing the use of machine learning methods to analyze card sort data. We applied these methods to a card sort task we designed to investigate students’ abilities to connect the three levels of understanding and representation in the chemistry triplet (Irby et al., 2016). In this task, general chemistry problems on three different topics are represented primarily on one of the three levels of the chemistry triplet, so that level of representation competes with the underlying principle needed to solve the problems. We hypothesized that as individuals gain expertise, they would more readily cross between levels of representation to group cards by the underlying principles needed to solve the problem. Although the quantitative evidence from the students’ sorts largely supported this hypothesis, much of the qualitative data from interviews in which participants justified their sorts were impossible to categorize into novice and expert anchor points. Thus, we used machine learning methods to uncover alternate sorting frameworks students used in order to help us investigate students’ schemas and knowledge structures in a more nuanced way, potentially aiding in the exploration of the development of expertise, which is arguably less well understood than the differences between novices and experts (Stains and Talanquer, 2008).

Research questions

The research questions guiding our study are: (1) How can machine learning be applied to help us better understand card sort data – both the sorts themselves and the justifications for the sorts, and (2) What new understandings of students’ schemas can be uncovered through the use of machine learning to analyze a chemistry card sort task crossing levels of the chemistry triplet with general chemistry principles?

Methods

In this paper, we describe a variety of unsupervised machine learning approaches to analyze students’ card sorts and their free-form written justifications using the chemistry card sort task (Irby et al., 2016). Unsupervised machine learning (Hastie et al., 2009) is a term that encompasses a collection of techniques aimed at learning from data or discovering patterns therein without explicit annotations, labels, or other guidance from a researcher. Our primary goal for analysis is to place each student in our dataset within a space such that students who sort or justify their sorts similarly are positioned closer to one another, and students who sort or justify their sorts dissimilarly are positioned far from each other. We use machine learning techniques to assign each student a sequence of coordinates defining their position in a high dimensional space. This sequence is called a vector. From these vectors we can identify and characterize potential themes. In this work we create two spaces: a “sort space” based on students’ sorts and a “language space” based on students’ justifications for their sorts. These two spaces enable us to identify separate themes related to sorts and justifications, potentially uncovering more nuanced ways students organize their knowledge around the chemistry problems. We can also begin to address whether students who sort alike also justify their sort in a similar way, and vice versa.

Data collection methods

Our data were collected from students in general and organic chemistry courses at a medium sized, master's-granting regional university. All students were given course credit for completing the task, and only data from those who consented to participate in the study in our IRB-approved informed consent process were included in the data set for analysis. Participants completed an online version of the chemistry card sort task, in which they were told to sort chemistry problems (cards) into different groups representing related concepts needed to solve the problems. The same general procedure and prompts described in Irby, et al. (2016) were used, including the requirement that students had to create between two and eight groups (i.e., they could not put all of the nine cards into one group, nor could they put each card in its own individual group). Initial data from a pilot of the online task showed similar sorting patterns as in the original study done via face to face interviews, reinforcing the finding that students with more chemistry preparation sorted, on average, increasingly by underlying principle. The online sort task included fields for students to write a title and justification for each group a student created. Responses were discarded if the student did not consent to participate in the study; gave duplicate responses; responded in fewer than five minutes; gave an incomplete response; gave inaccurate information about themself, their instructor, or their class; or if they responded in such a way that showed that they failed to read the card sort task instructions. In total, 1722 responses were deemed acceptable for analysis. These came from students enrolled in general or organic chemistry, as described in Table 1. Data from these students were aggregated to give a wide breadth of responses in order to allow us to develop machine learning methods of analysis.
Table 1 Number of classes, unique instructors, and students for each chemistry course
Chemistry course Classes Unique instructors Students
General chemistry 1 16 11 611
General chemistry 2 10 8 513
General chemistry 3 6 6 402
Organic chemistry 1 4 3 146


Card sort task

The card sort task used in this study (Irby et al., 2016) consists of 9 cards, each representing a typical problem from a general chemistry course. The card set crosses three underlying principles stoichiometry (ST), mass percent (MP), and dilution (DL) with the three levels of representation in the chemistry triplet, macroscopic (MA), submicroscopic (SA), and symbolic (SY), such that each card presents a unique combination of underlying principle and level of representation (Fig. 1). Participants are given these nine problems in random order and asked to sort the cards based on similar concepts one would need to solve each problem. This task was designed to have two canonical sortings, both of which consist of three groups of three cards each: a principle sort and a representation sort (the rows and columns, respectively, of Fig. 1). While participants may sort the cards any way they feel is appropriate, our previous study with this sorting task has shown that those with more chemistry preparation tend to sort the cards based on underlying principles, while those with less chemistry preparation sort their cards based on the representation of the cards or surface-level features (Irby et al., 2016).
image file: d2rp00029f-f1.tif
Fig. 1 Card sort task (Irby et al., 2016). Each row corresponds to an underlying principle and each column corresponds to a representation. Underlying principles and representations are shortened to two letter identifiers. Stoichiometry (ST), mass percent (MP), dilution (DL), submicroscopic (SA), macroscopic (MC), and symbolic (SY).

Methods for clustering students’ responses

Card sort tasks can be useful assessments of students’ knowledge organization, and clustering can reveal patterns in students’ knowledge structures, potentially providing insight into their schemas. Below we describe how we generated vectorized representations of both the language students used to justify their sorts, and the sorts themselves. Unsupervised machine learning techniques then used these representations to generate interpretable clusters, or groups, of similar data.
Defining the language space. In order to cluster the language used by students, we must first preprocess the text and then translate it into a numerical vector representation. First, all of the group titles and justification text were stacked into one single document per student. This is the “natural language” data that was used to define the language space. Second, we performed some text preprocessing: we (1) corrected minor misspellings (e.g., “stochiometry” was corrected to “stoichiometry”, “mitchondria” was corrected to “mitochondria”), (2) removed most punctuation, and (3) removed casing (i.e., uppercase letters were changed to lowercase letters). Note that we opted not to stem the words (e.g., shortening “calculations” to “calculation”) because the methods we define below work best on unstemmed forms, since stemming discards important cues to role of the word in the sentence and thus the author's intent. Lastly, in order to cluster student justifications, the text of the justification needs to be converted into a vector representation. Therefore, we propose a method to produce vector representations of text using a pre-trained bidirectional encoder representations from transformers (BERT) neural network model (Devlin et al., 2018). BERT begins with a learned vector representation of each word, called a “word embedding.” A word embedding is a learned vector representation (e.g., in 512 dimensional space), situated in a word space where semantically and syntactically similar words occupy a similar location within that space. For example, the distance between the vectors representing the words “king” and “queen” would be small because both words have similar meanings or usage within the english language. In contrast, the vectors representing the words “king” and “pie” would be further apart because both words have dissimilar meanings and often used in different contexts. Word embedding models tend to learn quite sophisticated representations; for example, if the word embedding for “man” is subtracted from the word embedding for “king” and the result is added to the word embedding for “woman”, then the resulting vector ends up similar to the word embedding for “queen” (Mikolov et al., 2013). A BERT model takes the word embedding inputs and passes them through a sequence of computational components known as encoder-style transformer blocks (Vaswani et al., 2017). Each block allows BERT to produce progressively richer representations of each word by taking into account the potentially long-range context in which that word appears. The key to this context-awareness is a computational mechanism known as “self-attention,” in which each word's representation is augmented by the representations of other relevant words in the sentence. This means that identical wordforms (e.g., “bark” from a tree and “bark” from a dog) will yield different embeddings depending on the context in which they are used. After the final transformer block, the per-word embeddings were averaged over all words in the justification, resulting in a single vector that provides a representation of the meaning of the entire student justification. The resulting justification vectors define the language space, in which responses are located based on the language used to name and justify the sorts. Later, we describe how we cluster these justifications based on their location in this language space. We note that BERT is far from the only model capable of producing justification vectors, but we chose it because of its effectiveness, popularity, ease of use, and the availability of model weights already trained on substantial amounts of data.
Defining the sort space. Like the language space, the sort space is defined using vectorized representations of a student's response. However, this time the vectorized representations capture a student's actual sort instead of their justification. In order to define the sort space, we first needed a quantitative expression of how close or far from each canonical sort each participant's sort was. To serve this purpose, we used a metric similar to the sorting coordinate, which is calculated using a co-occurrence matrix to determine a “distance” between two sorts, with higher numbers representing greater distances or dissimilarities between two sorts. Co-occurrence matrices have been commonly used as an analytic technique for card sort tasks (Martine and Rugg, 2005; Sanders et al., 2005). Sorting coordinates were defined and used as a means to measure the distance between a student sort and one of the two defined canonical sorts in (Irby et al., 2016). However, here we use them exclusively as a means of defining the “Sort Space,” based on distances between pairs of student sorts. We chose sorting coordinates over other distance metrics summarized in the introduction because there are fewer unique sorts that can have the same score.

Using sorting coordinates, we calculated the distance from each student sort to each other student sort, resulting in vectors containing 1722 integers for each of the 1722 student responses. These distances define an implicit sort space, where similar sorts occupy a similar location within that space. We can use this sort space to identify groups (clusters) with similarities we may not have uncovered through a manual sorting study.

Clustering students’ responses

Once we have defined a space for both the language and the sorts students produce, we can divide up this space into clusters of student responses, where the responses in each cluster are more similar to each other than the those outside their assigned cluster.
K-Means. After defining the spaces, we used a learning algorithm called k-means (MacQueen, 1967) to identify clusters within those spaces. K-Means takes a vector representation of each data point and produces a clustering. Each cluster is defined by a “centroid,” which is the average vector of the points belonging to the cluster. An optimal clustering is one that minimizes the average distance from each point to its cluster centroid. The initial centroids are chosen randomly, and the k-means algorithm alternates between (1) assigning points to their nearest centroid, and (2) updating centroids to better represent their cluster, until convergence is reached.

We used the k-means algorithm to identify clusters of data within the language and sort spaces. The expectation is that students who used similar language to justify their sort will be assigned to the same language space cluster. Likewise, students who sorted in a similar way will belong to the same sort space cluster, allowing us to identify potential new common sorting patterns.

Selecting the optimal number of clusters. When clustering, a classic challenge is to select the optimal number of clusters. Some useful heuristics exist; for example, trying to spot an inflection point where adding more clusters leads to diminishing returns. One popular principled approach is to use the “silhouette score”, which measures how related a data point is to its own cluster. A low silhouette score can indicate too many or too few clusters (Rousseeuw, 1987). In this work we attempt to select the number of clusters for our analysis based on a combination of silhouette score and qualitative judgement of the resulting clusters. In cases where the optimal number of clusters is unclear, we explored how our techniques could be used with different numbers of clusters. For brevity, we do not report silhouette scores because they have limited interpretive value beyond identifying an optimal number of cluster or lack thereof.
Methods for analyzing the discovered clusters. Card sort tasks are often used to elicit information about how students organize concepts, and may also be able to give insights into their schemas. Once a number of qualitatively interpretable clusters have been selected, we quantitatively and qualitatively analyzed these clusters in order to better understand the strategies that students use to group concepts, and to identify potential areas of difficulty or misunderstanding. Here we describe a number of different methods we used to analyze the clusters generated by the methods described above.
Most representative words. A word is considered representative if the rate of occurrence of that word within a cluster is larger than the rate of occurrence of that word among all responses. We can calculate a score for how representative a word is by using the following equation:
Wr = (Wc/Tc)/(Wo/To)
where Wc and Tc are the occurrence of a particular word and total number of words within a cluster, respectively, and Wo and To are the occurrence of a particular word and total number of words within all of the student responses, respectively. Words that have a high Wr value provide insight into commonalities of the language within each cluster and help us justify a qualitative title for the cluster. We only report the top 10 most representative words, but all words were considered for analysis.
Word cloud. Word clouds were generated and scaled according to the calculated score for the most representative words in each cluster. The more often a word was found among student justifications within a cluster compared to outside the cluster, the larger the word appeared within the cloud. The colors within the word cloud were used to distinguish one word from another, not to indicate any underlying meaning or relationship between words.
Most characteristic student response. In order to further characterize the clusters, we identified a student response that is most characteristic of the cluster as a whole. We did this by finding the student embedding nearest to the cluster centroid. The idea behind this is that we wanted to find the student response within a cluster that had the minimum average distance between itself and all the other student responses. This approach ensured we obtained the most informative response about the cluster's overall meaning. It is important to note that this response is the closest to the average within its cluster, not necessarily the furthest from other cluster centers. Furthermore, the most characteristic student response might not always align with the top 5 most common sorts, as it was determined by proximity to the cluster's centroid rather than frequency.
Edit distance. Edit distance (Deibel et al., 2005; Smith et al., 2013) is a distance metric commonly used for card sort tasks to determine how far one sort is from another. Edit distance is calculated as the minimum number of changes one would have to make to turn one sort into another. We primarily used edit distance to determine how different a sort is from one of our two canonical groupings. If a sort had only one misplaced card from a purely underlying principle canonical sort, then that student would have a principle edit distance score (PED) of 1. If a student had made a perfectly principle-based canonical sort, they student would have a PED of 0, but that student would also have a representation edit distance score (RED) of 6, the number of changes it would take to turn a canonical principle sort into a canonical representation sort. The maximum edit distance a student can get for PED or RED is 6, and the minimum score is 0, which would represent an identical sort.
Canonical group occurrences. Another metric we used to characterize the sorts was the number of full canonical groups, groups that entirely match our pre-determined representation or principle-based sorts. We used this metric to describe how many principle-based canonical groups (PCG) or representation-based canonical groups (RCG) were in a sort. We then averaged the occurrence of these canonical groups (meanPCG, meanRCG) to characterize groups of responses.
Sort block figures. Finally, after clustering with the k-means algorithm, we created a visualization called a sort block figure in order to visualize the most common sorts within each cluster. In this figure, the nine cards are ordered such that the rows represent the cards that belong to the principle-based canonical groups (ST, MP, DL) and the columns represent the cards that belong to the surface-level representation-based canonical groups (MA, MC, SY). Then, the cards are colored according to which group a student sorted each card into, such that all cards in a student-sorted group have the same color. Fig. 2 shows examples of sort block figures for two different sorts.
image file: d2rp00029f-f2.tif
Fig. 2 Examples of sort block figures: representation-based canonical sort (left) and a noncanonical sort consisting of one principle-based group, two partial principle-based groups, and one partial representation-based group (right).

Results

We used the k-means clustering algorithm to address our first research question regarding how we can use machine learning to understand card sort data. Each cluster generated from this procedure potentially represents a theme within the dataset, helping us to identify emergent themes that would be difficult to detect through manual coding. By uncovering emergent themes, we can begin to address our second research question regarding how we can use machine learning gather new understandings of students’ knowledge structures, which, as discussed above, may be related to their schemas. Because the acquisition of expertise involves the organization of factual information into categories for easy retrieval and application, an alternative perspective on the way that students categorize information could also shed light on how students acquire expertise. In this section we describe the clusters generated first in the language space then in the sort space. Finally, we characterize intersections between the two.

Language clustering

The justifications were processed into numerical vectors, or embeddings. As described in Methods, this was done by passing each student justification through a pre-trained BERT language model and taking the mean of the word embeddings to produce a single sentence embedding. Using these BERT sentence embeddings, we were able to generate clusters of similar responses. Metrics like the silhouette score did not indicate how many clusters ought to be made for the language clusters, primarily because there was no clear trend as a function of the number of clusters. Instead, we analyzed the dataset a number of times with different numbers of clusters and chose the number we thought generated the most informative themes.

The Sankey diagram in Fig. 3 summarizes the different clusters that were uncovered as a greater number of clusters were chosen for the k-means clustering algorithm, with lines representing clusters and ribbons representing changes in group membership as different numbers of clusters were chosen. The width of each ribbon scales with the number of justifications it represents. We gave each cluster a label based on manual examination of the justifications, as quantitative metrics were insufficient to uncover the themes.


image file: d2rp00029f-f3.tif
Fig. 3 Sankey diagram representing the distribution of justifications from two to six clusters in the language space.

Regardless of how many clusters were specified, there seemed to be clusters that we could qualitatively characterize as a principle-based. As more language clusters were specified, the justified “principle” cluster's average PED approaches zero, indicating that underlying principle language is closely tied to the underlying principle sort. In contrast, representation as a theme in the language data was no longer evident when three or more clusters were specified. Instead, various distinct themes perhaps tied to representation-based justifications became evident. For example, the “Representation” cluster is split into two from step two to three in the Sankey diagram. The first of these two new clusters was characterized as “Biology”, where students used unique language related to biological concepts. This theme remains from three clusters onward. The other new cluster in the transition from two clusters to three is characterized by us as “Difficulty”, which consists of justifications expressing some level of difficulty in addressing the card sort task, including words like “difficult”, “harder”, “hard”, and “help”.

When four clusters were specified, subsets of the “Difficulty” and “Principle” clusters combined to form a “Meta” cluster, where students justified their sorts by focusing on the mechanics of the task itself, instead of the underlying concepts. Students in the “Meta” cluster often described the constraints of the task or the act of placing the cards. There continued to be a justifiable “Meta” theme with up to six clusters. The remaining subset of the difficulty cluster is best characterized as a “Numerical vs. Conceptual” cluster, containing responses expressing a perceived dichotomy between problems requiring math and problems requiring understanding of one or more concepts.

As five clusters were specified, there remained qualitatively justifiable “Principle”, “Biology”, “Meta”, and “Numerical vs. Conceptual” clusters. In addition to these four clusters, we characterized the new cluster as a “Reasoning” cluster, which was created from a subset of the “Numerical vs. Conceptual” and “Principle” clusters. The “Reasoning” cluster is composed of responses revealing the methods of solving a problem that include words like “applied”, “solved”, and “reasoning.” These justifications suggest process-oriented thinking in approaching the task.

When six clusters were specified, four clusters shared a characterization from the five clusters specification: “Principle”, “Biology”, “Meta”, and “Numerical vs. Conceptual.” In addition, we characterized two new clusters: “Complexity” and “Course Reflection.” The “Complexity” cluster is primarily a subset of the reasoning cluster that contains words like “complex”, “basics”, and “intro”, suggesting expressions of how complex or basic students perceive a chemistry problem to be. The “Course Reflection” cluster is primarily composed of a subset of the “Meta” cluster and a subset of the “Reasoning” cluster and contains words like “time”, “learning”, and “taught”, which partially suggest recollection of learned knowledge. The most representative student response in this cluster: “I chose the particular groups based on my previous knowledge of chemistry and how it has been taught to me in the past” further supports this characterization.

Finally, we found that a cluster that could be justifiably characterized as “Underlying Principle” could be found regardless of the number of language clusters specified. The students whose responses were categorized in this cluster also tended to sort according to the underlying principle canonical group. However, there was not a single cluster that could be justifiably characterized as belonging to our representation-based canonical grouping beyond when two clusters were specified. Instead, the representation cluster seemed to split into many different themes, perhaps suggesting the external representations of the cards were less meaningful to students than the other themes. Although there is no clear method for choosing a number of clusters, one could argue that all of the themes uncovered in the analysis of the different sets of clusters are potentially elements of students’ schemas when presented with chemistry problems represented in different ways. Further, the observation that the principle-based themes seemed more stable than representation-based themes in the clustering analysis may lead to insights about the relative stability of these ways of organizing knowledge, and support claims that principle-based sorting is associated with more expertise. insights relate to our first research question regarding how machine learning can be applied to help us better understand card sort data.

Four language clusters. For the sake of analysis, we chose the four-cluster partitioning, as this generated a diverse selection of themes while also ensuring a large number of responses per cluster. We describe each of these clusters below. For each cluster we describe the most representative words, most common sorts, most characteristic student responses, and various quantitative metrics, including mean principle edit distance (meanPED), mean representative edit distance (meanRED), mean principle canonical groups (meanPCG), and mean representative canonical groups (meanRCG). We use these metrics in conjunction with one another for our characterization. The total number of students who made a given sort within the top 5 most common sorts in a cluster are listed above the sort block figure, but these do not represent the total variety of sorts found within that given cluster.
Biology language cluster. The first cluster is named “Biology” based on the most representative words in the justifications of the sorts within this cluster (Fig. 4). These include “cells,” “cell,” “biology,” “biochemistry,” and “biological.” Recall that we did not stem the words, leading to variants of the same word appearing in this list (e.g., “cells” and “cell”); for transparency we opted not to merge examples like these as a post-processing step. Sorts in the biology cluster are closer to a representation sort (meanRED = 3.27) than a principle sort (meanPED = 4.23), but are, on average, still relatively far from both sorts. Relatedly, fewer than half the canonical groups are principle (meanPCG = 0.31) compared to representation (meanRCG = 0.84), but the average occurrence of both is low overall. For the sorts within this language cluster, a common feature is the creation of a single group containing the macroscopic-dilution card, which depicts a problem involving a mitochondrion.
image file: d2rp00029f-f4.tif
Fig. 4 Biology language cluster. Word cloud with most representative words, 5 most common sorts, and most characteristic student natural language justification for the biology language cluster. Mean principle edit distance (meanPED), mean representation edit distance (meanRED), mean principle canonical groups (meanPCG), and mean representation canonical groups (meanRCG) reported for this cluster.

The most characteristic student response seems to support our earlier analysis. This student also seemed to identify some underlying principles, as one group was titled “Percent Mass”, which appears to relate to the “Mass Percent” canonical grouping, and another group was titled “Molarity”, seemingly relating to the “Dilutions” canonical grouping. However, these student groups only represent the underlying principle canonical groups in name only, as the cards within these student groups do not resemble what we would expect from a student who organizes their knowledge around underlying principles. Still, this student's use of language seems to show that they were able to identify these underlying principles as meaningful group categories. In a similar way, this student created a group titled “reaction equations”, which relates to the “symbolic” surface-level representation canonical sort, but only contains the symbolic-stoichiometry card.

Numerical vs. conceptual language cluster. The second of the four clusters, shown in Fig. 5, seems to combine two potentially related ways of thinking about chemistry problems – numerically or conceptually. Students within this cluster sorted their cards closer to the representation canonical sort than the other clusters (meanPED = 4.27; meanRED = 3.2), and there is on average one canonical representation group in each student sort (meanPCG = 0.34, meanRCG = 1.07). A common group in this cluster included all three symbolic representation cards combined with the two macroscopic representation cards that depicted numerical values in the problem description. The most representative words for responses in this cluster include “mathematical”, “calculations”, “calculation”, and “math”, suggesting the numerical representation of chemistry problems is a defining feature. However, words like “conceptual”, “reasoning”, and “thinking” were also common, suggesting students made a contrasting group that they saw as more focused on non-mathematical ideas or concepts. Another common feature of sorts in this cluster involved three small-particle representation cards being placed in the same group, perhaps suggesting the small particle model is functioning as an abstract concept to some students. Based on the combination of mathematical and conceptual language and groupings, this cluster was entitled, “Numerical vs. Conceptual”.
image file: d2rp00029f-f5.tif
Fig. 5 Numerical vs. conceptual language cluster. Word cloud with most representative words, 5 most common sorts, and most characteristic student natural language justification for the numerical vs. conceptual language cluster. Mean principle edit distance (meanPED), mean representation edit distance (meanRED), mean principle canonical groups (meanPCG), and mean representation canonical groups (meanRCG) reported for this cluster.

The most characteristic student response illustrates this dichotomy. This student justified their sort by suggesting that some of the cards “fit together based on the structure of the problem (conceptual vs. analytical).” This student's second group was named “conceptual concepts”, and they sorted the cards within that group because the student thought that they did not require calculations. The third and fourth group justifications make reference to “formulas” and making calculations, as opposed to a “concept.” This student also showed some lack of certainty in the way they sorted their cards, as they indicate in their overall sort justification by suggesting that there may be some better way that these cards could be sorted. Additionally, this student indicates that they were unsure about how to sort the cards in their first group titled “Combustion”. The language this student uses, along with their uncertain justifications, support the idea that this is a cluster of students who see a “conceptual vs. mathmatical” divide among these problems, but there is also some expression of the difficulty of the task.

Meta language cluster. The third language cluster (Fig. 6) again does not fit on a simple spectrum from principle to representation. This cluster seems to comprise responses that focused mostly on describing the act of performing the task. The most representative justification used words like “noticed”, “placed”, “named”, “card”, and “asked.” The students in this cluster did not sort based on underlying principles (meanPED = 3.56; meanPCG = 0.54) or on the visual representations (meanRED = 3.61; meanRCG = 0.70). The most characteristic student response for this cluster also supports the characterization of this cluster as primarily containing commentary on the task itself. This student justified their overall sort by describing in detail their process of sorting, failing to expand on their thinking about why certain cards belonged together.
image file: d2rp00029f-f6.tif
Fig. 6 Meta language cluster. Word cloud with most representative words, 5 most common sorts, and most characteristic student natural language justification for the meta language cluster. Mean principle edit distance (meanPED), mean representation edit distance (meanRED), mean principle canonical groups (meanPCG), and mean representation canonical groups (meanRCG) reported for this cluster.
Principle language cluster. The final language cluster (Fig. 6) fits most closely to the “underlying principle” canonical sort. Roughly two-thirds of students who made a principle level sort were placed into this language cluster; and of all the clusters, the sorts in this cluster have the smallest average edit distance from the purely principle sorting (meanPED = 3.22; meanRED = 4.06). Also for this cluster, there are few occurrences of either principle or representation canonical groups (meanPCG = 0.71; meanRCG = 0.5). The 3.22 average edit distance from the principle canonical sort is still fairly large for a clustering we are labeling “principle,” especially considering there are few principle canonical groups. However, the most representative words used by students in this cluster seem to map nicely onto the underlying principles built into the task. “Yield”, “reactants”, “excess”, “initial”, and “reagents” seem to point towards a “stoichiometry” group; “aqueous” and “molar” seem to point to a “dilutions” group; and the “%” symbol along with the word “composition”, seem to point to a “mass percent” group. The discrepancy between what the language indicates about the thinking of students within this cluster and what the sort metrics seem to show may suggest that there are students who express their thinking in a principle-based way but do not necessarily sort in this way.

The most characteristic justification for this cluster suggested the student was able to identify the underlying principles of the problems printed on the cards among the various surface-level representations. Of these students’ four groups, three seem to map onto the three underlying principles built into the task. This student called their first group “stoichiometry”, and proceeded to describe the principle. The student continued to describe their second group as “ratios”, describing how this group included cards that have to do with molar masses and mole fractions, indicating a conceptual understanding of the “mass percent” chemistry underlying principle. This student's third group is called “molarity” and they describe how all the questions in that group had to do with “moles of solute in various solutions,” which maps to a dilutions canonical underlying principle group. While this student seemed to do well in identifying the underlying principles present in the task, they struggled to place the cards into canonical principle groups. Despite this student's justification being the most representative response of the principle language cluster, this student achieved a PED of 3 and a RED of 4, which is neither close to the underlying principle canonical sort nor the surface-level representation canonical sort. In this case, looking at the sort scoring metrics alone would not have captured this student's sort as based in underlying principles.

While the most common sort within a cluster may provide valuable information to a researcher, its primary function is to quantify the dominant pattern in a data set. However, it does not necessarily reflect the unique characteristics and nuanced features of a specific cluster. The primary goal of our analysis seeks to capture more than frequency. For instance, in the principle language cluster, while a large plurality of students provided a canonical principle-based sort, this uniformity can only offer insights into what is prevalent but not necessarily what is distinctive.

The ‘most characteristic student response’ better captures the essence of a cluster by identifying a response that is the most representative of the characteristics defined by the cluster's centroid, which might show subtleties and specificities that are potentially lost by analyzing the ‘most common sort.’ Researchers in Irby et al. found that students who sorted based on principles exhibited principle-based reasoning in their sort justification (Irby et al., 2016), and that is consistent with what we find in this paper. However, we are interested in the characteristic response that embodies the distinct features of the principle language cluster, which may or may not align with the most common sort.

The analytical techniques described in this paper aligns with the affordances offered by machine learning, allowing researchers to analyze a large amount of data, identify underlying patterns, and glean insights that would be infeasible to uncover using traditional qualitative methods.

Clustering in the sort space

We can compare to what extent students’ justifications match their actual sorts by defining, and creating clusters within, a “Sort Space”. The sort space was created by assigning each student a vector containing a series of distances between that student's sort and the sort of every other student. Students who made similar sorts were positioned near one another in this sort space, while students who made dissimilar sorts were positioned far from one another. We created these clusters using the k-means clustering algorithm. Each cluster can be interpreted a theme found in the student card sort responses. A student's sort is indicative of the way they organize their knowledge. This helps us understand how the student approaches the material, what sort of knowledge they prioritize, and can potentially provide a window into their schemas. As in the previous section, the names chosen for these clusters were qualitatively determined by considering the most common sorts, the most representative words, and the various quantitative methods used to characterize groups of student responses, including edit distance and triplet occurrence (Fig. 7 and 8).
image file: d2rp00029f-f7.tif
Fig. 7 Principle language cluster. Word cloud with most representative words, 5 most common sorts, and most characteristic student natural language justification for the principle language cluster. Mean principle edit distance (meanPED), mean representation edit distance (meanRED), mean principle canonical groups (meanPCG), and mean representation canonical groups (meanRCG) reported for this cluster.

image file: d2rp00029f-f8.tif
Fig. 8 Sankey diagram representing the distribution of justifications from two to six clusters in the sort space.

When five clusters were specified, the previous four groups carried forward with the addition of a new cluster, which we characterized as a “Solubility” cluster. This cluster is composed primarily of two subsets of sorts belonging to the “Underlying Principle” and “Surface-Level Representation” clusters when four clusters were specified. These sorts tended to be closer to the underlying principle canonical sorting, but they emphasized the microscopic surface-level representation. As six clusters were specified, five of the clusters could be justifiably characterized with the same titles as the previous clusters, with the sixth characterized as a “Periodic Table” cluster. Responses in this cluster included language suggesting students based their sorts on information and trends found in the periodic table.

In this case we were able to use a silhouette score to determine the ideal number of clusters, as the silhouette score peaked when three clusters were specified. This is especially interesting considering the card sort task was originally designed with two methods of sorting in mind. Because our primary goal was to investigate the utility of machine learning to analyze the language data, we do not describe the sort clusters in detail here. This information can be found in the appendix. The most notable findings from this exercise for the purposes of our study are that students’ language did not always match their sorts, and that different types of groups outside of the principle and representation canonical sorts were defined in the two spaces. These findings are further explored below, where the language and sort spaces are compared.

Language and sort cluster intersections

It is conceivable that students who sort alike do not think alike, and vice versa. Examining the intersection between the clusters of students who sort alike and the clusters of students who use similar language can shed light on students’ knowledge structures and potentially the schemas they use to approach problems.
Cluster link graph. To understand the relationship between the way a student sorts and the way they justify their sort, we represented the degree in which clusters overlap using a cluster link graph (Fig. 9), where the nodes on the left and right represent the four language clusters and three sort clusters, respectively. One observation from this visualization is that the representation cluster splits relatively evenly into all of the four language clusters, suggesting students who sort in a representation-based way may justify their thinking in many diverging ways, one of which is based on underlying principles. Notably, 198 students were grouped in the principle cluster in the language space but in the representation cluster in the sort space. In contrast, 47% of students with sorts that were clustered into the sort space principle cluster generally produced language corresponding to the language space principle cluster. Additionally, only a relatively small number of students belonging to the principle sort cluster produced language that clustered them into either the biology (11%) or the numerical vs. conceptual (12%) language clusters, which in the above analyses are associated with representation-based justifications. The remaining 30% of students who sorted according to principles in the sort space were grouped in the meta language space cluster, which was largely outside of the representation-principle dichotomy. This suggests that students’ justifications are more likely to match their sorts if they sort by principle, than if they sort by representation.
image file: d2rp00029f-f9.tif
Fig. 9 Graph representing the intersections between clusters in the language space (left) and the sort space (right). Lines represent the fraction of students shared between the two clusters, with the wider and more yellow lines representing greater overlap.

About a third of students within the difficulty sort cluster produced language that put them into the numerical vs. conceptual language cluster. A key feature of this cluster was the prevalence of language expressing difficulty in completing the card sort task. There is relatively equal representation of the other three language clusters within the difficulty sort cluster, with the biology language cluster least represented. A more detailed analysis of a subset of these sort intersections can be found in the appendix.

By clustering in both the sort space and the language space, we find that there is no clear one-to-one relationship between the language that students use and the sort that they make, especially as it relates to a non-expert level ways of organizing knowledge. However, we find that expert ways of sorting cards tend to use similar kinds of language because there is greater crossover between the principle language cluster and principle sort cluster than the other clusters. This suggests novice students who sort similarly may not organize their knowledge similarly. Similarly, novice students with similar knowledge structures may sort in diverging ways.

Conclusion

Discussion

In this paper we describe methods involving machine learning to analyze results from a card sort task and its associated language data. These methods helped to uncover nuances in student thinking that are potentially associated with their representational schemas related to the chemistry triplet. One important finding is that students had many different ways of expressing their ideas about how the problems were related to each other, especially when their sorts took on more characteristics of surface-level sorts. They characterized problems by difficulty, by discipline, or by the process needed to solve the problem.

Notably, several students shared a particular sort but justified their sorts in different ways. For example, the canonical surface-level representation group was a member of the top five most common sorts of the biology language cluster, the numerical vs. conceptual language cluster, and the meta language cluster. This shows the sorts themselves don’t tell the whole story about students’ reasoning, and that analysis of students’ language is necessary to gain a fuller picture of how they organize knowledge. This is an illustration of the potential richness that machine learning can add to a card sort analysis, as the identification of more than one reasoning type for a single sort would have been very difficult through manual analysis. Machine learning is one way to supplement existing methods to further unpack justifications in cases like these.

Further, unpacking these results potentially adds finer detail to the conceptual framework on which it is based. The card sort task was developed with two canonical sorts related to principle- to representation-level thinking, based upon the concept of the chemistry triplet (Johnstone, 1982). When clustering the language into four clusters, we found there was a distinct underlying principle cluster, but that there was not an explicit, unique surface-level representation language cluster. Instead, the three remaining clusters (meta, numerical vs. conceptual, and biology) seemed to act as different kinds of surface-level feature clusters. The fact that more surface-level feature clusters were created, compared to principle-based clusters, suggests that students used a greater diversity of distinct language when justifying surface-level features sorts that grouped each triplet level together, while students used more homogenous language when justifying underlying principle sorts, where the three levels of the chemistry triplet were combined. One potential implication of this finding is that there are several ways to organize knowledge within each of the three levels of the triplet, but fewer ways to connect them. In order for a chemistry novice to make a sort based on the chemistry triplet, the novice must have enough chemistry expertise to identify the chemistry triplet among the cards. Some students struggled to make any connections between the cards, and they ended up creating few groups with many cards or many groups with few cards. This seems to suggest that there is a lower “level” sort below what we defined as the novice sort. Additionally, making connections between the levels of the triplet may show expertise, but it also may show random pairings due to a complete lack of expertise.

Our work demonstrates the potential usefulness of machine learning methods to analysis of results from card sort data. When paired with human expertise, judgment, and interpretation, machine learning can provide ways of gaining insights that may not be easily uncovered through other methods for analyzing card sort tasks. Traditional qualitative analysis alone often requires a high investment of resources, and still may not capture certain nuances in student thinking, especially regarding categories or organizational schemes that the card sort task was not originally designed to elicit. Here we demonstrate that machine learning techniques in association with a card sort task can enhance the analysis of language data and potentially uncover novel ways of organizing and reasoning with the cards. Further, machine learning techniques may be used to develop and validate new card sort tasks, allowing researchers to create a task where the language used by students and the sort they give closely align. Such alignment may provide evidence that a student's sort closely approximates their knowledge structure and may give valid insight into their schema.

Limitations

A primary limitation in our method comes from our choice to use the k-means clustering algorithm. Recall that the k-means clustering algorithm places centroids, or the average vector of the points belonging to a cluster, within the clustering space and then adds data points to each of the clusters based on their proximity to the cluster centroids. The placement of these centroids is somewhat random. This does not mean that the clusters themselves are random, and the clusters in this analysis tend to closely align between random states, but the random variation makes it difficult to make claims about cross-sectional or longitudinal changes between student groups. Future work may include new strategies that can allow a researcher to make claims about how students transition between these clusters with more domain-specific knowledge preparation. As of this time, these methods may be best used as a means of characterizing a dataset as a whole. Future work may involve the use of different methods of clustering data.

Within the sort space, we were able to determine an optimal number of clusters. An optimal cluster is one that minimizes the variance within a cluster while maximizing the distinctness between clusters. However, there was no clear optimal number of clusters for the language space. Though the clusters within the language space are still qualitatively interpretable (i.e., there is a justifiable qualitative explanation for why those students ought to be placed in the same cluster), a quantitative metric to justify an optimal clustering does not exist. Future studies could explore the use of various algorithms to find optimal numbers of clusters.

For many of the clusters created using the k-means algorithm, a commonality could be found between the cards in that cluster. However, sometimes a cluster seemed to be characterized primarily by what was most unique about a sort or a written justification. For example, many students in the “Biology” language cluster tended to provide language relating to biology, but the remainder of their sort often reflected a diversity of different ways of thinking when qualitatively analyzed. This suggests machine learning techniques may provide insight into general themes that can summarize the collected data, but do not always uncover themes related to knowledge structures or schemas. This issue points to the necessity of human qualitative analysis to make judgments about the meaningfulness of data and suggests machine learning may represent a starting point or supplement to qualitative data analysis.

Another potential limitation is that we based our study on the assumption that a student's written justification for their sort provides another valid window into their knowledge structure that can complement the information from their sort. However, in some cases, students' sorts appeared not to align with their justifications. This calls into question the extent to which either the sorts themselves or the justifying language can be validly linked through machine learning techniques to students’ knowledge structures. There are limitations in both the card sort, which was designed with particular anchors in mind and chose three among many potential general chemistry topics for underlying principles, and students’ written justifications, as students are not always facile with expressing their ideas in writing. Although we did conduct think-aloud interviews during pilot stages of this study, our dataset was not large enough to capture differences between verbal and written expression. The techniques outlined in this paper could potentially be adapted to an interview-based data collection strategy, which could add to the research by comparing clusters arrived at through analysis of verbal versus written justifications. In such a study, it may be possible with the ability to probe students’ thinking through follow-up prompts, to establish links between students’ sorts, language, and knowledge structures. This limitation further illustrates that machine learning should only be used as a supplement to, not a substitute for, human intervention and analysis.

Finally, the structure of the card sort task itself limited what we were able to learn from a machine learning analysis of the sorts and justifications. This particular task was designed with clear novice and expert anchors in mind, which limited the number of interesting and meaningful sorts and justifications outside of those two anchors. An open-ended card sort task, such as that in Yamauchi (2005) coupled with machine learning might generate a more authentic window into students’ knowledge structures, and potentially also their schemas.

Implications for instruction

Ultimately, machine learning methods for analyzing the results from card sort tasks are only useful if they add knowledge that can guide instructional practice. We learned several things from our machine learning analysis of the triplet-oriented card sort tasks that have implications for instruction. One such implication is that students first need to recognize the ideas presented in a problem in order to be able to create useful schema for solving them, and it seems that for many students, the representations depicted on the cards neither helped nor hindered their ability to do this. The implication is that strategies for concept recognition should be explicitly taught and practiced, and this necessitates the interleaving of problems from a range of different topics, something that is not always done in a general chemistry classroom. Once students are able to recognize some of the underlying principles of the problems, it appears they are better able to organize their knowledge and justify it consistent with these principles.

Our finding that students’ justifications and sorts do not always overlap suggests that when formatively assessing students’ expertise, especially when it comes to connecting or categorizing multiple ideas, more than one data point is necessary. Sources of evidence should include opportunities for students to express their thinking in an open-ended manner. When students’ explanations do not seem consistent with their performance on other tasks, it is an opportunity to inquire further into their thinking (as well as the assessment methods), rather than assuming a position along the continuum from novice to expert. It may be that students have a truly novice level of expertise and require further instruction, but it may also be that they have difficulty expressing their ideas or knowledge structures, or even that they have a different way of thinking about or organizing concepts that could be fruitful to build upon.

Finally, our finding that many students were unable to recognize the underlying principles and sorted based on type of representation, perceived difficulty, or even discipline, reinforces a need to explicitly teach students about the different levels of the chemistry triplet. This should go beyond including multiple representations in problems, but should explicitly connect those representations in ways that show their roles in scientific modeling. In this way, helping students develop their ability to make connections between these levels engages them in using important scientific practices.

Conflicts of interest

No conflicts of interest to report.

Appendices

Appendix

Analysis of three sort clusters

As before with the language clusters, we can characterize the sort clusters using the various techniques we outlined in the methods section. These include the most representative words, the most characteristic student response, the most common sorts, and various quantitative metrics.
Representation sort cluster. The first cluster (Fig. 10) contains sorts resembling the canonical representation sort, including all 44 students who reproduced the sort exactly. This suggests that a reasonable characterization for this cluster would be as a “Surface-level Representation” cluster. The average representation edit distance (RED) was about 2.5, whereas the principle edit distance (PED) was 4.86, indicating that the student sorts within this cluster are closer to the canonical representation sorting than the canonical principle sort. Additionally, there was an average of 1.17 representation canonical groups, but only 0.01 principle canonical groups, again suggesting that the student sorts within this cluster are closer to the representation sort than the principle sort.
image file: d2rp00029f-f10.tif
Fig. 10 Representation sort cluster. Word cloud with most representative words, 5 most common sorts, and most characteristic student natural language justification for the representation sort cluster. Mean principle edit distance (meanPED), mean representation edit distance (meanRED), mean principle canonical groups (meanPCG), and mean representation canonical groups (meanRCG) reported for this cluster.

The representative language of this “Representation” cluster also supports this characterization. Students used words like “diagrams”, “images”, and “visualizing”, potentially referring to the actual representations on the card and how they are visualizations of chemistry principles. Students also used words and phrases like “real life” and “real world”, potentially referring to how many cards within the representation sorting seem to have some relation to an observable phenomena on the macroscopic scale. Relatedly, students used words like “applications”, “situations”, and “apply”, potentially referring to how the representations on the cards apply to underlying principles of chemistry.

The most characteristic student response for this cluster also supports this cluster's characterization as a “surface-level representation” cluster. This student had a RED score of 2 and a PED score of 6, showing that they sorted closer to the surface level representation canonical sort than to the underlying principle canonical sort. This student justified their sort by stating that it was “based on the visual representations” and that they “organised [their cards] to what elements [they] saw in the diagrams/pictures and the situation or frame of the questions.” This student made 4 groups, three of which map neatly onto the three canonical surface-level representation groupings based on the language justifications. This student's first group was selftitled as “formula and equations problems” and corresponds to the “Symbolic” canonical grouping; the student's second group was self-titled as “Physical application”, which seems to align with the “Macroscopic” canonical grouping; and the student's third group was self-titled as ”Atomic structure and properties”, which matches the “submicroscopic” canonical group. The student's final group was titled “biological” and contained one card involving mitochondria in the cell. Considering that the most representative student of this cluster created groups that mapped closely to canonical surface-level representation groups suggests that the characterization of this cluster as a “surface-level representation” cluster is reasonable.

Principle sort cluster. The second cluster (Fig. 11) of the three generated contained all 61 students who had made a canonical principle sort along with other sorts resembling a principle sort. This suggests that reasonable characterization for this cluster would be as a canonical “Underlying Principle” cluster. The average PED is 2.37, while the average RED is 4.87, indicating that the student sorts within this cluster are closer to the canonical principle sort than the canonical representation sort. Additionally, there was an average of 1.04 principle canonical groups, but an average 0.01 representation canonical groups, again suggesting that this cluster is composed primarily of student sorts closer to the principle canonical sort.
image file: d2rp00029f-f11.tif
Fig. 11 Principle sort cluster. Word cloud with most representative words, 5 most common sorts, and most characteristic student natural language justification for the principle sort cluster. Mean principle edit distance (meanPED), mean representation edit distance (meanRED), mean principle canonical groups (meanPCG), and mean representation canonical groups (meanRCG) reported for this cluster.

The language that students used within this cluster supports our categorization of this cluster as an “Underlying Principle” cluster. Students in this cluster used words like “solubility”, “aqueous”, “dilutions”, “dilution”, and “reagent”, which all seem to refer to the “Dilution” canonical grouping (DL). Students also used words like “percentage”, “percent”, and “composition”, words that indicate that students are describing the mass percent (MP) canonical grouping. Lastly, students used words like “excess”, “limiting”, “reactant”, and “initial”, which seems to suggest that students are describing the stoichiometry canonical grouping (ST). All the most representative words of this cluster can be qualitatively used to infer a categorization of that cluster.

The most characteristic student response for this cluster also supports this cluster's characterization as an “underlying principle” cluster. This student had a RED score of 6 and a PED score of 2, showing that they sorted closer to the underlying principle canonical sort than to the surface-level representation canonical sort. This student based their sort on “the math that had to be done” to solve the problems. Four groups were created by this student, three of which correspond closely to how we would expect a student to justify the three canonical underlying principle groups. The first group this student generated was titled “Ratios”, containing two cards that supposedly required the ratio of particles, which does not correspond to a canonical group. The second group was titled “Stoichiometry Reactions”, which relates directly to the “Stoichiometry” underlying principle canonical group; the third student group was titled “Percent Mass”, which again relates directly to the “Mass Percent” underlying principle canonical group; and the fourth group was titled “Molarity Concentrations”, which seems to map to the “Dilution” underlying principle canonical group. Because the most characteristic student response within this cluster created groups that seem to correspond to what we would expect in an “Underlying Principle” canonical sort, it suggests that this cluster can be characterized as an “Underlying Principle” cluster.

Difficulty sort cluster. The third cluster (Fig. 12) contained sorts that are more similar to a “Surface-Level Representation” canonical sort than to “Underlying Principle” canonical sort, but it seems to have characteristics that distinguish it from both. For this cluster, the mean RED is 3.66 and the mean PED is 4.16, and the mean number of representation canonical groups is 1.16 and the mean number of principle canonical groups is 0.63. Based on these metrics, this cluster does not contain student responses that are close to either the “Underlying Principle” canonical sort or the “Surface-Level Representation” canonical sort. Instead, this cluster seems to represent a third-way of thinking, different from the two canonical groupings. A key feature of the sorts within this cluster is that they often only contain few groups with many cards in each group.
image file: d2rp00029f-f12.tif
Fig. 12 Difficulty sort cluster. Word cloud with most representative words, 5 most common sorts, and most characteristic student natural language justification for the difficulty sort cluster. Mean principle edit distance (meanPED), mean representation edit distance (meanRED), mean principle canonical groups (meanPCG), and mean representation canonical groups (meanRCG) reported for this cluster.

It is difficult to characterize this cluster with the principle and representation binary as the average sort within this cluster is far from both the canonical underlying principle sort and the canonical underlying principle sort. From the language, we can see that some of the most representative words for this cluster were “believe,” “remember,” “taught,” “complicated,” “presented,” “learn” and “subjects”, possibly indicating that these students are struggling to recall information that they learned from class to identify specific underlying principles or concepts relevant to make an informed sort. Based on these representative words, we categorized this cluster as a “Difficulty” cluster, as students were finding it difficult to recall information and identify relevant information that would help them perform this task.

The justification for the most characteristic sort for this cluster suggested the student was able to identify some underlying principles, but perceived the rest of the cards as both unique and unrelated. This student reported that they made one group that was based on the kind of calculation each question was asking, and the rest were deemed to be “outliers that appeared to be different concepts.” This student made four groups, one with five of the nine cards, and three more groups containing the remaining four cards. The first group was titled “calculating moles” because, according to this student, “you are accounting for concentration and calculating number of moles/limiting reactants.” This justification seems to suggest that the student identified two underlying principles, dilutions and stoichiometry, but they didn’t see those concepts as distinct enough to separate them out into their own groups. Instead, this student elected to make one large group containing a majority of the cards. The remaining cards were separated out into their own small groups, two of which only had one card each, suggesting that this student struggled to identify information that united those cards together. This student sort captures the difficulty that a student may have in identifying relevant information necessary to find the similarities and differences among chemistry problems.

Intersection characteristic sort

Much like how we found representative student responses for each individual cluster, we can find the most representative responses in the same way for each intersection between clusters. Looking at the most characteristic student sorts for some of the intersections between the sort clusters and the language clusters can provide valuable insight into the relationship between student sorts and their language.
Principle sort and principle language cluster intersection. For example, if we wanted to look at the relationship between students who sorted in a principle-level way and explained their sort in a principle level way, we would characterize the students that occupy the intersection between the “Underlying Principle” sort cluster and the “Underlying Principle” language cluster. The most characteristic student response for the student found in the intersection between the “Underlying Principle” sort cluster and the “Underlying Principle” language cluster (Fig. 13) shows a response that we would expect from a student with chemistry expertise. This student made four groups, but two groups had the same title, as they showed that they reformulated their thinking toward underlying principles mid-response. This student named their groups “Stoichiometry”, “Molecules”, and “Concentrations”, each corresponding to the “Stoichiometry”, “Mass Percent”, and “Dilutions” underlying principle groups. The mean PED for the students in this principle–principle intersection is 2.19, a score indicating that the students within this cluster are creating particularly principle-level sorts. This mean PED score of 2.19 is closer to the canonical underlying principle sort than the sort “Underlying Principle” cluster (2.37) and the language “Underlying Principle” cluster (3.22) alone. We can see that students who use principle-level language tend to sort closer to the underlying principle canonical sort than those who use non-principle-level language. By looking at the intersection between the sort and language “Underlying Principle” clusters, we can see that among students who create an underlying principle-like sort, the students who explain their thinking using principle-level language sort are even closer to the canonical underlying principle sort than those who do not. This provides evidence for the idea that a student's knowledge organization around underlying principles is related to their use of underlying principle language.
image file: d2rp00029f-f13.tif
Fig. 13 Sort associated with the most characteristic student response for the intersection between the principle sort cluster and the principle language cluster.
Difficulty sort and principle language cluster intersection. The most characteristic student response for students who were clustered into both the “Difficulty” sort cluster and the “Principle” language cluster (Fig. 14) demonstrates an interesting intersection of thinking that can be captured by evaluating a student response based on sorts and based on language. This student created three groups: “Concentration”, “Stoichiometry”, and “Calculating the mass of a molecule”. These groups seem to match the three underlying principles present in the chemistry problems on the card. However, this student expressed uncertainty in their ability to do the task, suggesting, “It was very hard to decide what stoichiometry problems to split up, and I am still not sure I am happy with how I did split them.” This student's “Stoichiometry” group contains five of the nine cards, three of which are stoichiometry problems, but the other two are mass percent and dilution problems. This student seems to have struggled to discriminate between the kinds of problems printed on each card, but they were using language that suggested that they were thinking in a principle-level way. This shows how clustering based on both language and the sort can assist in identifying central themes in the way that students sort their cards, which often intersect in unique ways, providing a fuller picture for how students think and organize their knowledge.
image file: d2rp00029f-f14.tif
Fig. 14 Sort associated with the most characteristic student response for the intersection between the difficulty sort cluster and the principle language cluster.
Representation sort and principle language cluster intersection. The most characteristic student response for students clustered in the “Representation” sort cluster and the “Principle” language cluster (Fig. 15) shows how this technique can identify students approaching an underlying principle way of organizing their knowledge, which isn’t necessarily captured in their sort alone. This student created four groups: “Combustion”, “Stoichiometry”, “Physics”, and “Solutions”, three of which correspond to the three underlying principle canonical groups. This student's “Combustion” group and “Solutions” group seem to correspond to the “Stoichiometry” and “Dilutions” underlying principle canonical groups. This student's “Physics” group was justified as involving “ratios of masses”, which corresponds to the “Mass Percent” Underlying Principle canonical group. However, this student created a canonical “Symbolic” group and titled it “Stoichiometry.” There seems to be a potential conflation between the stoichiometry as a principle in chemistry and symbolic representations of chemistry problems. However, this response seems to illustrate how this student is able to identify underlying principles and explore their understanding of underlying principles with their free-form language justification. This example seems to show a student approaching expert-level thinking, but their sort alone would not have necessarily captured the nuances of this response.
image file: d2rp00029f-f15.tif
Fig. 15 Sort associated with the most characteristic student response for the intersection between the representation sort cluster and the principle language cluster.
Difficulty sort and numerical vs. conceptual language cluster intersection. The most characteristic student response for students who belong in the “Difficulty” sort cluster and the “Numerical vs. Conceptual” language cluster (Fig. 16) shows are sharing of features unique to both clusters. For example, a common feature of the “Difficulty” sort cluster was that students tended to create a few groups and place many cards into one of those groups. This student made a sort that corresponds to what we would expect from a perfectly “Surface-Level Representation” canonical sort, but they merged the macroscopic and symbolic canonical groups into one. A common feature of the “Numerical vs. Conceptual” language cluster was that students identified a dichotomous relationship between cards relating to mathematics and cards relating to non-math concepts. This student justified also their sort in this way, naming their largest group “Calculation” and their smaller group “Concept.” This is an example of how the intersection between clusters in different spaces can provide meaningful student responses sharing qualities from both clusters.
image file: d2rp00029f-f16.tif
Fig. 16 Sort associated with the most characteristic student response for the intersection between the difficulty sort cluster and the numerical vs. conceptual language cluster.

Acknowledgements

We would like to thank students and instructors, who were gracious enough to assist with data collection. We would also like to thank Spenser Stumpf and Matt Smiley for their contributions to this work. And, we offer our thanks to the Washington NASA Space Grant Consortium for contributing to the funding this project.

References

  1. Acton W. H., Johnson P. J. and Goldsmith T. E., (1994), Structural Knowledge Assessment: Comparison of Referent Structures, J. Educ. Psychol., 86(2), 303–311.
  2. Bransford J. D., Brown A. L. and Cocking R. R., (2000), How People Learn: Brain, Mind, Experience, and School, Washington, DC: National Academy Press.
  3. Bussolon S., (2009), Card sorting, category validity, and contextual navigation, J. Inf. Archit., 5–29.
  4. Carruthers S. P., Gurvich C. T., Meyer D., Bousman C., Everall I. P., Neill E., Pantelis C., Sumner P. J., Tan E. J. and Thomas E. H., (2019), Exploring heterogeneity on the Wisconsin card sorting test in schizophrenia spectrum disorders: a cluster analytical investigation, J. Int. Neuropsychol. Soc., 25(7), 750–760.
  5. Chen Z., (1999), Schema induction in children's analogical problem solving, J. Educ. Psychol., 91(4), 703–715.
  6. Chi M. T. H., Feltovich P. J. and Glaser R., (1981), Categorization and representation of physics problems by experts and novices, Cognitive Sci., 5(2), 121–152.
  7. Davidson A., Addison C. and Charbonneau J., (2022), Examining Course-Level Conceptual Connections Using a Card Sort Task: A Case Study in a First-Year, Interdisciplinary, Earth Science Laboratory Course, Teach. Learn. Inquiry, 10, 1–20.
  8. Deibel K., Anderson R. and Anderson R., (2005), Using edit distance to analyze card sorts, Expert Systems, 22(3), 129–138.
  9. Devlin J., Chang M.-W., Lee K. and Toutanova K., (2018), Bert: pre-training of deep bidirectional transformers for language understanding, arXiv, preprint, arXiv:1810.04805.
  10. Domin D. S., Al-Masum M. and Mensah J., (2008), Students’ categorizations of organic compounds, Chem. Educ. Res. Pract., 9(2), 114–121.
  11. Ewing G., Logie R., Hunter J., McIntosh N., Rudkin S. and Freer Y., (2002), A new measure summarising ‘Information’ conveyed in cluster analysis of card-sort data, IDAMAP 2002, 4, 25.
  12. Eysenck M. W. and Keane M. T., (2005), Cognitive Psychology: A Student's Handbook, 5th edn, New York: Psychology Press.
  13. Fincher S. and Tenenberg J., (2005), Making sense of card sorting data, Expert Systems, 22(3), 89–93.
  14. Fossum T. and Haller S., (2005), Measuring card sort orthogonality, Expert Systems, 22(3), 139–146.
  15. Galloway K. R., Leung M. W. and Flynn A. B., (2018), A Comparison of How Undergraduates, Graduate Students, and Professors Organize Organic Chemistry Reactions, J. Chem. Educ., 95(3), 355–365.
  16. Galotti K. M., (2014), Cognitive Psychology In and Out of the Laboratory, 5th edn, Los Angeles: SAGE Publications.
  17. Gentner D., (2005), The development of relational category knowledge, in Gershkoff-Stowe L. and Rakison D. H. (ed.) Building object categories in developmental time, Hillsdale, NJ: Erlbaum.
  18. Gentner D. and Medina J., (1998), Similarity and the development of rules, Cognition., 65(2–3), 263–297.
  19. Gläscher J., Adolphs R. and Tranel D., (2019), Model-based lesion mapping of cognitive control using the Wisconsin Card Sorting Test, Nat. Commun., 10(1), 20.
  20. Graulich N. and Bhattacharyya G., (2017), Investigating students’ similarity judgments in organic chemistry, Chem. Educ. Res. Pract., 18(4), 774–784.
  21. Hastie T., Tibshirani R. and Jerome F., (2009), The Elements of Statistical Learning Data Mining, Inference, and Prediction, Second edn, Springer.
  22. Irby S., Phu A., Borda E. J., Haskell T., Steed N. and Meyer Z., (2016), Use of a card sort task to assess students’ ability to coordinate three levels of representation in chemistry, Chem. Educ. Res. Pract., 17(2), 337–352.
  23. Jaber L. Z. and BouJaoude S., (2012), A Macro-Micro-Symbolic Teaching to Promote Relational Understanding of Chemical Reactions, Int. J. Sci. Educ., 34(7), 973–998.
  24. Johnstone A. H., (1982), Macro- and micro-chemistry, School Sci. Rev., 64, 377–379.
  25. Kern A. L., Wood N. B., Roehrig G. H. and Nyachwaya J., (2010), A qualitative report of the ways high school chemistry students attempt to represent a chemical reaction at the atomic/molecular level, Chem. Educ. Res. Pract., 11(3), 165–172.
  26. Kozma R. B. and Russell J., (1997), Multimedia and understanding: expert and novice responses to different representations of chemical phenomena, J. Res. Sci. Teach., 34(9), 949–968.
  27. Krieter F. E., Julius R. W., Tanner K. D., Bush S. D. and Scott G. E., (2016), Thinking Like a Chemist: Development of a Chemistry Card-Sorting Task To Probe Conceptual Expertise, J. Chem. Educ., 93(5), 811–820.
  28. Lajoie S. P., (2003), Transitions and Trajectories for Studies of Expertise, Educ. Res., 32(8), 21–25.
  29. Lapierre K. R., Streja N. and Flynn A. B., (2022), Investigating the role of multiple categorization tasks in a curriculum designed around mechanistic patterns and principles, Chem. Educ. Res. Pract., 23(3), 545–559.
  30. Lin S.-Y. and Singh C., (2010), Categorization of Quantum Mechanics Problems by Professors and Students.
  31. Macías J. A., (2021), Enhancing Card Sorting Dendrograms through the Holistic Analysis of Distance Methods and Linkage Criteria, J. Usability Stud., 16(2), 73–90.
  32. MacQueen J., (1967), Some methods for classification and analysis of multivariate observations, in the proceedings of the Berkeley symposium on mathematical statistics and probability, in Fifth Berkeley Symposium on Mathematical Statistics and Probability, Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, Berkeley, USA: University of California Press, pp. 281–297.
  33. Martine G. and Rugg G., (2005), That site looks 88.46% familiar: quantifying similarity of Web page design, Expert Systems, 22(3), 115–120.
  34. Mason A. and Singh C., (2011), Assessing expertise in introductory physics using categorization task, Phys. Rev. ST Phys. Educ. Res., 7(2), 020110.
  35. Mayer R. E., (2012), Information Processing, in APA Educational Psychology Handbook, WA: American Psychological Association, pp. 85–99.
  36. McCauley R., Murphy L., Westbrook S., Haller S., Zander C., Fossum T., Sanders K., Morrison B., Richards B. and Anderson R., (2005), What do successful computer science students know? An integrative analysis using card sort measures and content analysis to evaluate graduating students’ knowledge of programming concepts, Expert Systems, 22(3), 147–159.
  37. Mikolov T., Chen K., Corrado G. and Dean J., (2013), Efficient estimation of word representations in vector space, arXiv, preprint, arXiv:1301.3781.
  38. National Research Council, (2012), A framework for K-12 science education: Practices, crosscutting concepts, and core ideas, Washington, DC: The National Academies Press.
  39. Nehm R. H. and Ridgway J., (2011), What do experts and novices “see” in evolutionary problems? Evolution: Educ. Outreach. 4(4), 666–679.
  40. Paea S., Katsanos C. and Bulivou G., (2021), Information architecture: Using K-Means clustering and the Best Merge Method for open card sorting data analysis, Interact. Comput., 33(6), 670–689.
  41. Paul C. L., (2014), Analyzing Card-Sorting Data Using Graph Visualization, J. Usability Stud., 9(3), 87–104.
  42. Revlin R., (2012), Cognition: Theory and Practice, 1st edn, New York: Worth Publishers.
  43. Rousseeuw P. J., (1987), Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J. Comput. Appl. Math., 20, 53–65.
  44. Sanders K., Fincher S., Bouvier D., Lewandowski G., Morrison B., Murphy L., Petre M., Richards B., Tenenberg J., Thomas L., Anderson R., Anderson R., Fitzgerald S., Gutschow A., Haller S., Lister R., McCauley R., McTaggart J., Prasad C., Scott T., Shinners-Kennedy D., Westbrook S. and Zander C., (2005), A multi-institutional, multinational study of programming concepts using card sort data, Expert Systems, 22(3), 121–128.
  45. Shinde P., Szwillus G. and Keil I. R., (2017), Application of Existing k-means Algorithms for the Evaluation of Card Sorting Experiments, PhD thesis, Paderborn University Paderborn, Germany.
  46. Singer S. R., Nielsen N. R. and Schweingruber H. A., (2012), Discipline-Based Education Research: Understanding and Improving Learning in Undergraduate Science and Engineering, Washington, DC: National Academies Press.
  47. Smith M. U., (1992), Expertise and the organization of knowledge: unexpected differences among genetic counselors, faculty, and students on problem categorization tasks, J. Res. Sci. Teach., 29(2), 179–205.
  48. Smith J. I., Combs E. D., Nagami P. H., Alto V. M., Goh H. G., Gourdet M. A. A., Hough C. M., Nickell A. E., Peer A. G., Coley J. D. and Tanner K. D., (2013), Development of the Biology Card Sorting Task to Measure Conceptual Expertise in Biology, Cbe-Life Sci. Educ., 12(4), 628–644.
  49. Stains M. and Talanquer V., (2008), Classification of chemical reactions: stages of expertise, J. Res. Sci. Teach., 45(7), 771–793.
  50. Stiles-Shields C., Montague E., Lattie E. G., Kwasny M. J. and Mohr D. C., (2017), What might get in the way: Barriers to the use of apps for depression, DIGITAL HEALTH. 3, 2055207617713827.
  51. Taber K. S., (2013), Revisiting the chemistry triplet: drawing upon the nature of chemical knowledge and the psychology of learning to inform chemistry education, Chem. Educ. Res. Pract., 14(2), 156–168.
  52. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L. and Polosukhin I., (2017), Attention is all you need, arXiv, preprint, arXiv:1706.03762.
  53. Yamauchi T., (2005), Labeling Bias and Categorical Induction: Generative Aspects of Category Information, J. Exp. Psychol. Learn., Memory, Cognition, 31(3), 538–553.

This journal is © The Royal Society of Chemistry 2024