Stephen M.
Danczak
*,
Christopher D.
Thompson
and
Tina L.
Overton
School of Chemistry, Monash University, Victoria 3800, Australia. E-mail: stephen.danczak@monash.edu
First published on 12th July 2019
The importance of developing and assessing student critical thinking at university can be seen through its inclusion as a graduate attribute for universities and from research highlighting the value employers, educators and students place on demonstrating critical thinking skills. Critical thinking skills are seldom explicitly assessed at universities. Commercial critical thinking assessments, which are often generic in context, are available. However, literature suggests that assessments that use a context relevant to the students more accurately reflect their critical thinking skills. This paper describes the development and evaluation of a chemistry critical thinking test (the Danczak–Overton–Thompson Chemistry Critical Thinking Test or DOT test), set in a chemistry context, and designed to be administered to undergraduate chemistry students at any level of study. Development and evaluation occurred over three versions of the DOT test through a variety of quantitative and qualitative reliability and validity testing phases. The studies suggest that the final version of the DOT test has good internal reliability, strong test–retest reliability, moderate convergent validity relative to a commercially available test and is independent of previous academic achievement and university of study. Criterion validity testing revealed that third year students performed statistically significantly better on the DOT test relative to first year students, and postgraduates and academics performed statistically significantly better than third year students. The statistical and qualitative analysis indicates that the DOT test is a suitable instrument for the chemistry education community to use to measure the development of undergraduate chemistry students’ critical thinking skills.
A survey of 167 recent science graduates compared the development of a variety of skills at university to the skills used in the work place (Sarkar et al., 2016). It found that 30% of graduates in full-time positions identified critical thinking as one of the top five skills they would like to have developed further within their undergraduate studies. Students, governments and employers all recognise that not only is developing students’ critical thinking an intrinsic good, but that it better prepares them to meet and exceed employer expectations when making decisions, solving problems and reflecting on their own performance (Lindsay, 2015). Hence, it has become somewhat of an expectation from governments, employers and students that it is the responsibility of higher education providers to develop students’ critical thinking skills. Yet, despite the clear need to develop these skills, measuring student attainment of critical thinking is challenging.
Cognitive psychologists and education researchers use the term critical thinking to describe a set of cognitive skills, strategies or behaviours that increase the likelihood of a desired outcome (Halpern, 1996b; Tiruneh et al., 2014). Psychologists typically investigate critical thinking experimentally and have developed a series of reasoning schemas with which to study and define critical thinking; conditional reasoning, statistical reasoning, methodological reasoning and verbal reasoning (Nisbett et al., 1987; Lehman and Nisbett, 1990). Halpern (1993) expanded on these schemas to define critical thinking as the thinking required to solve problems, formulate inferences, calculate likelihoods and make decisions.
In education research there is often an emphasis on critical thinking as a skill set (Bailin, 2002) or putting critical thought into tangible action (Barnett, 1997). Dressel and Mayhew (1954) suggested it is educationally useful to define critical thinking as the sum of specific behaviours which could be observed from student acts. They identify these critical thinking abilities as identifying central issues, recognising underlying assumptions, evaluating evidence or authority, and drawing warranted conclusions. Bailin (2002) raises the point that from a pedagogical perspective many of the skills or dispositions commonly used to define critical thinking are difficult to observe and, therefore, difficult to assess. Consequently, Bailin suggests that the concept of critical thinking should explicitly focus on adherence to criteria and standards to reflect ‘good’ critical thinking (Bailin, 2002, p. 368).
It appears that there are several definitions of critical thinking of equally valuable meaning (Moore, 2013). There is agreement across much of the field that meta-cognitive skills, such as self-evaluation, are essential to a well-rounded process of critical thinking (Glaser, 1984; Kuhn, 1999; Pithers and Soden, 2000). There are key themes such as ‘critical thinking: as judgement, as scepticism, as originality, as sensitive reading, or as rationality’ which can be identified across the literature. In the context of developing an individual's critical thinking it is important that these themes take the form of observable behaviours.
In the latter half of the 20th century informal logic gained academic credence as it challenged the previous ideas of logic being related purely to deduction or inference and that there were, in fact, theories of argumentation and logical fallacies (Johnson et al., 1996). These theories began to be taught at universities as standalone courses free from any context in efforts to teach the structure of arguments and recognition of fallacies using abstract theories and symbolism. Cognitive psychology research lent evidence to the argument that critical thinking could be developed within a specific discipline and those reasoning skills were, at least to some degree, transferable to situations encountered in daily life (Lehman et al., 1988; Lehman and Nisbett, 1990). These perspectives form the basis of the subject generalist, who believed critical thinking can be developed independent of subject specific knowledge.
McMillan (1987) carried out a review of 27 empirical studies conducted at higher education institutions where critical thinking was taught, either in standalone courses or integrated into discipline-specific courses such as science. The review found that standalone and integrated courses were equally successful in developing critical thinking, provided critical thinking developmental goals were made explicit to the students. The review also suggested that the development of critical thinking was most effective when its principles were taught across a variety of discipline areas so as to make knowledge retrieval easier.
Ennis (1989) suggested that there are a range of approaches through which critical thinking can be taught: general, where critical thinking is taught separate from content or ‘discipline’; infusion, where the subject matter is covered in great depth and teaching of critical thinking is explicit; immersion, where the subject matter is covered in great depth but critical thinking goals are implicit; and mixed, a combination of the general approach with either the infusion or immersion approach. Ennis (1990) arrived at a pragmatic view to concede that the best critical thinking occurs within one's area of expertise, or domain specificity, but that critical thinking can still be effectively developed with or without discipline specific knowledge (McMillan, 1987; Ennis, 1990).
Many scholars still remain entrenched in the debate regarding the role discipline-specific knowledge has in the development of critical thinking. For example, Moore (2011) rejected the use of critical thinking as a catch-all term to describe a range of cognitive skills, believing that to teach critical thinking as a set of generalisable skills is insufficient to provide students with an adequate foundation for the breadth of problems they will encounter throughout their studies. Conversely, Davies (2013) accepts that critical thinking skills share fundamentals at the basis of all disciplines and that there can be a need to accommodate the discipline-specific needs ‘higher up’ in tertiary education via the infusion approach. However, Davies considers the specifist approach to developing critical thinking ‘dangerous and wrong-headed’ (Davies, 2013, p. 543), citing government reports and primary literature which demonstrates tertiary students’ inability to identify elements of arguments, and championing the need for standalone critical thinking courses.
Pedagogical approaches to developing critical thinking in chemistry in higher education range from writing exercises (Oliver-Hoyo, 2003; Martineau and Boisvert, 2011; Stephenson and Sadler-Mcknight, 2016), inquiry-based projects (Gupta et al., 2015), flipped lectures (Flynn, 2011) and open-ended practicals (Klein and Carney, 2014) to gamification (Henderson, 2010), and work integrated learning (WIL) (Edwards et al., 2015). Researchers have demonstrated the benefits of developing critical thinking skills across all first, second and third year programs of an undergraduate degree (Phillips and Bond, 2004; Iwaoka et al., 2010). Phillips and Bond (2004) indicated that such interventions help develop a culture of inquiry, and better prepare students for employment.
Some studies demonstrate the outcomes of teaching interventions via validated commercially available critical thinking tests, available from a variety of vendors for a fee (Abrami et al., 2008; Tiruneh et al., 2014; Abrami et al., 2015; Carter et al., 2015). There are arguments against the generalisability of these commercially available tests. Many academics believe assessments need to closely align with the intervention(s) (Ennis, 1993) and a more accurate representation of student ability is obtained when a critical thinking assessment is related to a students’ discipline, as they attach greater significance to the assessment (Halpern, 1998).
Test | Question structure | Critical thinking skills assessed |
---|---|---|
California Critical Thinking Skills test (CCTST) (Insight Assessment, 2013) | 40 item multiple choice questions | Analysis, evaluation, inference, deduction and induction |
Watson-Glaser Critical Thinking Appraisal (WCGTA) (AssessmentDay Ltd, 2015) | 80 item multiple choice questions | Inference, deduction, drawing conclusions, making assumptions and assessing arguments |
Watson-Glaser Critical Thinking Appraisal Short Form (WGCTA-S) (Pearson, 2015) | 40 item multiple choice questions | Inference, deduction, drawing conclusions, making assumptions and assessing arguments |
Cornell Critical Thinking Test Level Z (CCTT-Z) (The Critical Thinking Co., 2017) | 52 item multiple choice questions | Induction, deduction, credibility, identification of assumption, semantics, definition and prediction in planning experiments |
Ennis–Weir Critical Thinking Essay Test (EWCTET) (Ennis and Weir, 1985) | Eight paragraphs which are letters containing errors in critical thinking and essay in response to these paragraphs. | Understanding the point, seeing reasons and assumptions, stating one's point, offering good reasons, seeing other possibilities and responding appropriately and/or avoiding poor argument structure |
Halpern Critical Thinking Assessment (HCTA) (Halpern, 2016) | 20 scenarios or passages followed by a combination of 25 multiple choice, ranking or rating alternatives and 25 short answer responses | Reasoning, argument analysis, hypothesis testing, likelihood and uncertainty analysis, decision making and problem solving |
Several reviews of empirical studies suggest that the WGCTA is the most prominent test in use (Behar-Horenstein and Niu, 2011; Carter et al., 2015; Huber and Kuncel, 2016). However, the CCTST was developed much later than the WGCTA and recent trends suggest the CCTST has gained popularity amongst researchers since its inception. Typically, the tests are administered to address questions regarding the development of critical thinking over time or the effect of a teaching intervention. The results of this testing are inconsistent; some studies report significant changes while others report no significant changes in critical thinking (Behar-Horenstein and Niu, 2011). For example, Carter et al. (2015) found studies which used the CCTST or the WGCTA did not all support the hypothesis of improved critical thinking with time, with some studies reporting increases, and some studies reporting decreases or no change over time. These reviews highlight the importance of experimental design when evaluating critical thinking. McMillan (1987) reviewed 27 studies and found that only seven of them demonstrated significant changes in critical thinking. He concluded that tests which were designed by the researcher are a better measure of critical thinking, as they specifically address the desired critical thinking learning outcomes, as opposed to commercially available tools which attempt to measure critical thinking as a broad and generalised construct.
Several examples of chemistry specific critical thinking tests and teaching tools were found in the literature. However, while all of these tests and teaching activities were set within a chemistry context, they require discipline specific knowledge and/or were not suitable for very large cohorts of students. For example Jacob (2004) presented students with six questions each consisting of a statement requiring an understanding of declarative chemical knowledge. Students were expected to select whether the conclusion was valid, possible or invalid and provide a short statement to explain their reasoning. Similarly, Kogut (1993) developed exercises where students were required to note observations and underlying assumptions of chemical phenomena then develop hypotheses, and experimental designs with which to test these hypotheses. However, understanding the observations and underlying assumptions was dependent on declarative chemical knowledge such as trends in the periodic table or the ideal gas law.
Garratt et al. (1999) developed an entire book dedicated to developing chemistry critical thinking, titled ‘A Question of Chemistry’. In writing this book the authors took the view that thinking critically in chemistry draws on the generic skills of critical thinking and what they call ‘an ethos of a particular scientific method’ (Garratt et al., 2000, p. 153). The approach to delivering these questions ranged from basic multiple choice questions, to rearranging statements to generate a cohesive argument, or open-ended responses. The statements preceding the questions are very discipline specific and the authors acknowledge they are inaccessible to a lay person. Overall the chemistry context is used because ‘it adds to the students’ motivation if they can see the exercises are firmly rooted in, and therefore relevant to, their chosen discipline’ (Garratt et al., 2000, p. 166).
Thus, an opportunity has been identified to develop a chemistry critical thinking test which could be used to assist chemistry educators and chemistry education researchers in evaluating the effectiveness of teaching interventions designed to develop the critical thinking skills of chemistry undergraduate students. This study aimed to determine whether a valid and reliable critical thinking test could be developed and contextualised within the discipline of chemistry, yet independent of any discipline-specific knowledge, so as to accurately reflect the critical thinking ability of chemistry students from any level of study, at any university.
This study describes the development and reliability and validity testing of an instrument with which to measure undergraduate chemistry students’ critical thinking skills: The Danczak–Overton–Thompson Chemistry Critical Thinking Test (DOT test).
Fig. 1 Flow chart of methodology consisting of writing and evaluating the reliability and validity of iterations of the DOT test. |
As critical thinking tests are considered to evaluate a psychometric construct (Nunnally and Bernstein, 1994) there must be supporting evidence of their reliability and validity (Kline, 2005; DeVellis, 2012). Changes made to each iteration of the DOT test and the qualitative and quantitative analysis performed at each stage of the study are described below.
The qualitative data for DOT V1 and DOT V2 were treated as separate studies. The data for these studies were collected and analysed separately as described in the following. The data collected from focus groups throughout this research were recorded with permission of the participants and transcribed verbatim into Microsoft Word, at which point participants were de-identified. The transcripts were then imported into NVivo version 11 and an initial analysis was performed to identify emergent themes. The data then underwent a second analysis to ensure any underlying themes were identified. A third review of the data used a redundancy approach to combine similar themes. The final themes were then used for subsequent coding of the transcripts (Bryman, 2008).
Therefore, students, teaching staff and employers did not define critical thinking in the holistic fashion of philosophers, cognitive psychologists or education researchers. In fact, very much in line with the constructivist paradigm (Ferguson, 2007), participants seem to have drawn on elements of critical thinking relative to the environments in which they had previously been required to use critical thinking. For example, students focused on analysis and problem solving, possibly due to the assessment driven environment of university, whereas employers cited innovation and global contexts, likely to be reflective of a commercial environment.
The definitions of critical thinking in the literature (Lehman et al., 1988; Facione, 1990; Halpern, 1996b) cover a wide range of skills and behaviours. These definitions often imply that to think critically necessitates that all of these skills or behaviours be demonstrated. However, it seems almost impossible that all of these attributes could be observed at a given point in time, let alone assessed (Dressel and Mayhew, 1954; Bailin, 2002). Whilst the students in Danczak et al. (2017) used ‘problem solving’ and ‘analysis’ to define critical thinking, it does not necessarily mean that their description accurately reflects the critical thinking skills they have actually acquired, but rather their perception of what critical thinking skills they have developed. Therefore, to base a chemistry critical thinking test solely on analysis and problem solving skills would lead to the omission of the assessment of other important aspects of critical thinking.
To this end, the operation definition of critical thinking acknowledges the analysis and problem solving focus that students predominately used to describe critical thinking, whilst expanding into other important aspects of critical thinking such as inference and judgement. Consequently, guidance was sought from existing critical thinking assessments, as described below.
The WGCTA is an 85 item test which has undergone extensive scrutiny in the literature since its inception in the 1920s (Behar-Horenstein and Niu, 2011; Huber and Kuncel, 2016). The WGCTA covers the core principles of critical thinking divided into five sections: inference, assumption identification, deduction, interpreting information and evaluation of arguments. The questions test each aspect of critical thinking independent of context. Each section consists of brief instructions and three or four short parent statements. Each parent statement acts as a prompt for three to seven subsequent questions. The instructions, parent statements, and the questions themselves were concise with respect to language and reading requirements. The fact that the WGCTA focused on testing assumptions, deductions, inferences, analysing arguments and interpreting information was an inherit limitation in its ability to assess all critical thinking skills and behaviours. However, these elements are commonly described by many definitions of critical thinking (Facione, 1990; Halpern, 1996b).
The questions from the WGCTA practice test were analysed and used to structure the DOT test. The pilot version of the DOT test (DOT P) was initially developed with 85 questions, set within a chemistry or science context, and using similar structure and instructions to the WGCTA with five sub-scales: making assumptions, analysing arguments, developing hypotheses, testing hypotheses, and drawing conclusions.
Below is an example of paired statements and questions written for the DOT P. This question is revisited throughout this paper to illustrate the evolution of the test throughout the study. The DOT P used essentially the same instructions as provided on the WGCTA practice test. In later versions of the DOT Test the instructions were changed as will be discussed later.
The parent statement of the exemplar question from the WGCTA required the participant to recognise proposition A and proposition B are different, and explicitly states that there is a relationship between proposition A and proposition C. This format was used to generate the following parent statement:
A chemist tested a metal centred complex by placing it in a magnetic field. The complex was attracted to the magnetic field. From this result the chemist decided the complex had unpaired electrons and was therefore paramagnetic rather than diamagnetic.
In writing an assumption question for the DOT test, paramagnetic and diamagnetic behaviour of metal complexes replaced propositions A and B. The relationship between propositions A and C was replaced with paramagnetic behaviour being related to unpaired electrons. The question then asked if it is a valid or invalid assumption that proposition B is not related to proposition C.
Diamagnetic metals centred complexes do not have any unpaired electrons.
The correct answer was a valid assumption as this question required the participant to identify proposition B and C were not related. The explanation for the correct answer was as follows:
The paragraph suggests that if the complex has unpaired electrons it is paramagnetic. This means diamagnetic complexes likely cannot have unpaired electrons.
All 85 questions on the WGCTA practice test were analysed in the manner exemplified above to develop the DOT P. In designing the test there were two requirements that had to be met. Firstly, the test needed to be able to be completed comfortably within 30 minutes to allow it to be administered in short time frames, such as at the end of laboratory sessions, and to increase the likelihood of voluntary completion by students. Secondly, the test needed to be able to accurately assess the critical thinking of chemistry students from any level of study, from first year general chemistry students to final year students. To this end, chemistry terminology was carefully chosen to ensure that prior knowledge of chemistry was not necessary to comprehend the questions. Chemical phenomena were explained and contextualised completely within the parent statement and the questions.
The test took in excess of 40 minutes to complete. Therefore, questions which were identified as unclear, which did not illicit the intended responses, or caused misconceptions of the scientific content were removed. The resulting DOT V1 contained seven questions relating to ‘Making Assumptions’, seven questions relating to ‘Analysing Arguments’, six questions relating to ‘Developing Hypotheses’, five questions relating to ‘Testing Hypotheses’ and five questions relating to ‘Drawing Conclusions’. The terms used to select a multiple choice option were written in a manner more accessible to science students, for example using terms such as ‘Valid Assumption’ or ‘Invalid Assumption’ instead of ‘Assumption Made’ or ‘Assumption Not Made’. Finally, the number of options in the ‘Developing Hypotheses’ section were reduced from five to three of ‘likely to be an accurate inference’, ‘insufficient information to determine accuracy’ and ‘unlikely to be an accurate inference’.
All responses to the DOT test and WGCTA-S were imported into IBM SPSS statistics (V22). Frequency tables were generated to determine erroneous or missing data. Data was considered erroneous when participants had selected ‘C’ to questions which only contained options ‘A’ or ‘B’, or when undergraduate students who identified their education/occupation as that of an academic. The erroneous data was deleted and treated as missing data points. In each study a variable was created to determine the sum of unanswered questions (missing data) for each participant. Pallant (2016, pp. 58–59) suggests a judgement call is required when considering missing data and whether to treat certain cases as genuine attempts to complete the test or not. In the context of this study a genuine attempt was based on the number of questions a participant left unanswered. Participants who attempted at least 27 questions were considered to have genuinely attempted the test.
Responses to all DOT Test (and WGTCA-S) questions were coded as correct or incorrect responses. Upon performing descriptive statistics, the DOT V1 scores were found to exhibit a normal (Gaussian) distribution whereas the DOT V3 exhibited a non-parametric (not-normal) distribution. In light of these distributions it was decided to treat all data obtained as non-parametric.
Internal reliability of each iteration of the DOT test was determined by calculating Cronbach's α (Cronbach, 1951). Within this study the comparison between two continuous variables was made using the non-parametric equivalent of a Pearson's r, Spearman's Rank Order test as recommended by Pallant (2016). Continuous variables included DOT test scores and previous academic achievement as measured by tertiary entrance scores (ATAR score). When comparing DOT test scores between education groups, which were treated as categorical variables, the non-parametric equivalent of t-test, a Mann–Whitney U test was used. When comparing a continuous variable of the same participants taken at different times using Wilcoxon signed rank test, the non-parametric equivalent of a paired t-test, was used.
The most extensive rewriting of the parent statements occurred in the ‘Analysing Arguments’ section. The feedback provided from the focus groups indicated that parent statements did not include sufficient information to adequately respond to the questions.
Additional qualifying statements were added to several questions in order to reduce ambiguity. In the parent statement of the exemplar question the first sentence was added to eliminate the need to understand that differences exist between diamagnetic and paramagnetic metal complexes, with respect to how they interact with magnetic fields:
Paramagnetic and diamagnetic metal complexes behave differently when exposed to a magnetic field. A chemist tested a metal complex by placing it in a magnetic field. From the result of the test the chemist decided the metal complex had unpaired electrons and was therefore paramagnetic.
Finally, great effort was made in the organisation of the DOT V2 to guide the test taker through a critical thinking process. Similar to Halpern's approach to analysing an argument (Halpern, 1996a), Halpern teaches that an argument is comprised of several conclusions, and that the credibility of these conclusions must be evaluated. Furthermore, the validity of any assumptions, inferences and deductions used to construct the conclusions within an argument need to be analysed. To this end the test taker was provided with scaffolding from making assumptions to analysing arguments in line with Halpern's approach.
On the first day, demographic data was collected: sex, dominant language, previous academic achievement using tertiary entrance scores (ATAR), level of chemistry being studied and highest level of chemistry study completed at Monash University. Students completed the DOT V2 using an optical reader multiple choice answer sheet. This was followed by completion of the WGCTA-S in line with procedures outlined by the Watson-Glaser critical thinking appraisal short form manual (2006). The WGCTA-S was chosen for analysis of convergent validity, as it was similar in length to the DOT V2 and was intended to measure the same aspects of critical thinking. The fact that participants completed the DOT V2 and then the WGCTA-S may have affected the participants’ performance on the WGCTA-S. This limitation will be addressed in the results.
After a brief break, the participants were divided into groups of five to eight students and interviewed about their overall impression of the WGCTA-S and their approach to various questions. Interviewers prevented the participants from discussing the DOT V2 so as not to influence each other's responses upon retesting.
On the second day participants repeated the DOT V2. DOT V2 attempts were completed on consecutive days to minimise participant attrition. Upon completion of the DOT V2 and after a short break, participants were divided into two groups of nine and interviewed about their impressions of the DOT V2, how they approached various questions and comparisons between the DOT V2 and WGCTA-S.
Responses to the tests and demographic data were imported into IBM SPSS (V22). Data was treated in accordance with the procedure outlined earlier. With the exception of tertiary entrance score, there was no missing or erroneous demographic data. Spearman's rank order correlations were performed comparing ATAR scores to scores on the WGCTA-S and the DOT V2. Test–retest reliability was determined using a Wilcoxon signed rank test (Pallant, 2016, pp. 234–236, pp. 249–253), When the scores of the tests taken at different times have no significant difference, as determined by a p value greater than 0.05, the test can be considered to have acceptable test–retest reliability (Pallant, 2016, p. 235). Acceptable test–retest reliability does not imply that test attempts are equivalent. Rather, good test–retest reliability suggests that the precision of the test to measure the construct of interest is acceptable. Median scores of the participants’ first attempt of the DOT V2 were compared with the median score of the participants’ second attempt of the DOT V2. To determine the convergent validity of the DOT V2 the relationship between scores on the DOT V2 and performance on the WGCTA-S was investigated using Spearman's Rank order correlation.
After approximately 15 minutes of participants freely discussing the relevant test, the interviewers asked the participants to look at a given section on a test, for example the ‘Testing Hypotheses’ section of the DOT V2, and identify any questions they found problematic. In the absence of students identifying any problematic questions, the interviewers used a list of questions from each test to prompt discussion. The participants were then asked as a group:
• ‘What do you think the question is asking you?’
• ‘What do you think is the important information in this question?’
• ‘Why did you give the answer(s) you did to this question?’
The interview recordings were transcribed and analysed in line with the procedures and theoretical frameworks described previously to result in four distinct themes which were used to code the transcripts.
Many scientific terms were either simplified or removed in the DOT V3. In the case of the exemplar question, the focus was moved to an alloy of thallium and lead rather than a ‘metal complex’. Generalising this question to focus on an alloy allowed these questions to retain scientific accuracy and reduce the tendency for participants to draw on knowledge outside the information presented in the questions:
Metals which are paramagnetic or diamagnetic behave differently when exposed to an induced magnetic field. A chemist tested a metallic alloy sample containing thallium and lead by placing it in an induced magnetic field. From the test results the chemist decided the metallic alloy sample repelled the induced magnetic field and therefore was diamagnetic.
This statement was then followed by the prompt asking the participant to decide if the assumption presented was valid or invalid:
Paramagnetic metals do not repel induced magnetic fields.
Several terms were rewritten as their use in science implied assumptions as identified by the student focus groups. These assumptions were not intended and hence the questions were reworded. For example, question 14 asked whether a ‘low yield’ would occur in a given synthetic route. The term ‘low yield’ was changed to ‘an insignificant amount’ to remove any assumptions regarding the term ‘yield’.
The study of the DOT V3 required participants to be drawn from several distinct groups in order to assess criterion and discriminate validity. For the purpose of criterion validity, the DOT V3 was administered to first year and third year undergraduate chemistry students, honours and PhD students and post-doctoral researchers at Monash University, and chemistry education academics from an online community of practice. Furthermore, third year undergraduate chemistry students from another Australian higher education institution (Curtin University) also completed the DOT V3 to determine discriminate validity with respect to performance of the DOT V3 outside of Monash University.
Third year participants were drawn from an advanced inorganic chemistry course at Monash University and a capstone chemical engineering course at Curtin University. 54 students (37%) responded to the DOT test at Monash University. The 23 students who completed the DOT V3 at Curtin University represented the entire cohort.
Post-doctoral researchers, honours and PhD students from Monash University were invited to attempt the DOT V3. 40 participants drawn from these cohorts attended a session where they completed the DOT V3 in a paper format, marking responses directly onto the test. All cohorts who completed the test in paper format required approximately 20 to 30 minutes to complete the test.
An online discussion group of approximately 300 chemistry academics with an interest in education, predominately from Australia, the UK and Europe, were invited to complete an online version of the DOT V3. Online completion was untimed and 46 participants completed the DOT V3.
Descriptive statistics of the 270 DOT V3 results revealed a larger proportion scores were above the mean, thus the data was considered non-parametric for the purposes of reliability and validity statistical analysis. Internal consistency was then determined by calculating Cronbach's α (Cronbach, 1951). The five sub-scales of the DOT V3 (Making Assumptions, Developing Hypotheses, Testing Hypotheses, Drawing Conclusion and Analysing Arguments) underwent a principle component analysis to determine the number of factors affecting the DOT V3.
The academic participants in the focus groups navigated the questions on the DOT V1 to arrive at the intended responses. However, there was rarely consensus within the group and a minority, usually one or two participants, disagreed with the group. The difficulties the academics had in responding to the DOT V1 were made clear from four themes which emerged from the analysis: ‘Instruction Clarity’, ‘Wording of the Question(s)’, ‘Information within the Statement’ and ‘Prior Knowledge’ (Table 2). The last theme was generally found to be associated with the other themes.
Theme (theme representation %) | Description | Example |
---|---|---|
Instruction clarity (15%) | Insufficient information within the instructions (sometimes due to superficial reading) | “…what do you mean by ‘is of significant importance’?” |
Wording of the question (41%) | Attributing meaning to specific words within a question | “…the language gives it away cause you say ‘rather than’.” |
Information within the parent statement (15%) | Adequate or insufficient information in the parent statement to response to the questions | “Basically what you’re providing in the preamble is the definition, right?” |
Prior knowledge (30%) | The using or requiring use of prior scientific knowledge from outside the test | “I didn’t think there was any information about the negative charge… I didn’t know what that meant, so I tried to go on the text.” |
The theme of ‘Instruction Clarity’ was used to describe when participants either had difficulty interpreting the instructions or intentionally ignored the instructions. Several participants self-reported their tendency to only scan the instructions without properly reading them, or did not read the statement preceding a question in its entirety. When this behaviour occurred, academics were quick to draw on outside knowledge. This theme identified the need for clarity of the instructions and providing relevant examples of what was meant by terms such as ‘is of significant importance’ or ‘is not of significant importance’.
The theme of ‘Wording of the Questions’ referred to evaluating the meaning and use of particular words within the questions or the parent statements. The wording of several questions led to confusion, causing the participants to draw on outside knowledge. Unclear terminology hindered non-science participants (education developers) from attempting the questions, and was further compounded by the use of terms such as ‘can only ever be’. For example, the use of the term ‘rather than’ confused participants when they knew a question had more than two alternatives.
The theme ‘Information within the Statement’ referred to the participants’ perceptions of the quality and depth of information provided in the parent statements. Participants suggested some test questions appeared to be non-sequiturs with respect the corresponding parent statements. Participants felt they did not have enough information to make a decision, and the lack of clarity in the instructions further compounded the problem for participants.
The theme ‘Prior Knowledge’ identified instances when participants had drawn on information not provided in the DOT V1 to answer the questions. Several issues regarding prior knowledge emerged from the discourse. Participants identified that there were some assumptions made about the use of the chemical notations. Finally some participants highlighted that having prior knowledge, specifically in science and/or chemistry, was to their detriment when attempting the questions.
A total of 15 participants provided their tertiary entrance score (ATAR), as a measure of previous academic achievement. There is some discussion in the literature which suggests university entrance scores obtained in high school do not reflect intelligence and cognitive ability (Richardson, Abraham and Bond, 2012). However, a comparison of previous academic achievement, reported via ATAR scores, revealed a small positive correlation with scores obtained on the DOT V2 (ρ = 0.23) and a moderately positive correlation with scores obtained on the WGCTA-S (ρ = 0.47).
“The second time it felt like I was just remembering what I put down the day before.”
The WGCTA-S manual (Watson and Glaser, 2006, pp. 30–31) listed three studies where test–retesting intervals of three months, two weeks or four days. Each of these studies reported test–retest correlations ranging from p = 0.73 to 0.89, and larger p values were associated with shorter time frames between test–retesting. However, as the p value of the Wilcoxon's signed rank test was sufficiently large (0.91), it was unlikely that the DOT V2 would have exhibited poor test–retest reliability were it to be administered over a longer time interval.
“I found the questions (on the DOT V2) a bit more interesting and engaging in general where as this one (WGCTA-S) seemed a bit more clinical.”
However, two participants did express their preference for the WGCTA-S citing the detailed examples in the instructions of each section, and their frustration when attempting the DOT V2, requiring them to recognise whether they were drawing on chemistry knowledge outside of the question.
The qualitative analysis of the student focus group data provided useful insight regarding the content validity of the DOT V2. When discussing their responses, the participants often arrived at a group consensus on the correct answers for both the DOT V2 and the WGCTA-S. Rarely did the participants initially arrive at a unanimous decision. In several instances on both tests, there were as many participants in favour of the incorrect response as there were participants in favour of the correct response. Four themes emerged from the analysis of the transcripts which are presented in Table 3.
Theme (theme representation %) | Description | Example |
---|---|---|
Strategies used to attempt test questions (46%) | Approaches participants took including dependence on examples, evaluating key words, construction of rules or hypothetical scenarios | “…you could come back to it and then look at how each of the example questions were answered…” |
Difficulties associate with prior knowledge (21%) | Participants consciously aware of their prior knowledge, either attempting to restrict its use or their prior knowledge is in conflict with their response to a given question | “It's quite difficult to leave previous knowledge and experience off when you’re trying to approach these (questions).” |
Terms used to articulate cognitive processes (22%) | Evidence of critical thinking and critical thinking terminology the participants were exposed to throughout the focus groups, in particular ‘bias’ | “I think like the first section…was more difficult than the other because I think I had more bias in that question.” |
Evidence of peer learning (11%) | Discourse between participants in which new insight was gained regarding how to approach test questions | “To me, the fact that you know it starts talking about…fall outside of the information and is therefore an invalid assumption.” |
The theme ‘Strategies used to attempt test questions’ describes both the participants’ overall practice and increasing familiarity with the style of questions, and also the specific cognitive techniques used in attempting to answer questions. The approach participants used when performing these tests was reflective of the fact they became more familiar with the style of questions and their dependence on the examples provided diminished.
Some participants had difficulty understanding what was meant by ‘Assumption Made’ and ‘Assumption Not Made’ in the ‘Recognition of Assumption’ section in the WGCTA-S and drew heavily on the worked examples provided in the introduction to the section. At the conclusion of this study, these participants had completed three critical thinking tests and were becoming familiar with how the questions were asked and what was considered a correct response. However, test–retesting with the DOT V2 indicated that there was no change in performance.
There was concern that providing detailed instructions on the DOT test may in fact develop the participants’ critical thinking skills in the process of attempting to measure it. For example, a study conducted (Heijltjes et al., 2015) with 152 undergraduate economics students who were divided into six approximately equal groups found that participants who were exposed to the written instructions performed on average 50% better on the critical thinking skills test compared to those who did not receive written instructions. It does seem plausible that a similar result would occur with the DOT test, and evaluating the impact of instructions and examples using control and test groups would be beneficial in future studies of the DOT test.
The second aspect of this theme was the application of problem solving skills and the generation of hypothetical scenarios whereby deductive logic could be applied. The following quote described an example of a participant explicitly categorising the information they were provided with in the parent statements and systematically analysing those relationships to answer the questions.
“I find that with (section) three, deduction, that it was really easy to think in terms of sets, it was easier to think in terms set rather than words, doodling Venn diagrams trying to solve these ones.”
The Delphi report considers behaviours described by this theme to be part of the interpretation critical thinking skill which describes the ability ‘to detect … relationships’ or ‘to paraphrase or make explicit … conventional or intended meaning’ of a variety of stimuli (Facione, 1990, p. 8). Others consider this behaviour to be more reflective of problem solving skills, describing the behaviour as ‘understanding of the information given’ in order to build a mental representation of the problem (OECD, 2014, p. 31). These patterns of problem solving were most evident in the discussions in response to the WGCTA-S questions.
With respect to the DOT V2, participants had difficulty understanding the intended meaning of the questions without drawing on the previous knowledge of chemistry and science. For example, there were unexpected discussions of the implication of the term ‘lower yield’ in a chemistry context and the relationship to a reaction failing. Participants pointed out underlying assumptions associated with terms such as ‘yield’ highlighting that the term ‘yield’ was not necessarily reflective of obtaining a desired product.
The theme ‘Difficulties associated with prior knowledge’ described when participants drew on knowledge from outside the test in efforts to respond to the questions. In both the WGCTA-S and the DOT V2, the instructions clearly stated to only use the information provided within the parent statements and the questions. These difficulties were most prevalent when participants described their experiences with the DOT V2. For example, the participants were asked to determine the validity of a statement regarding the relationship between the formal charge of anions and how readily anions accept hydrogen ions. In arriving at their answer, one student drew on their outside knowledge of large molecules such as proteins to suggest:
“What if you had some ridiculous molecule that has like a 3 minus charge but the negative zones are all the way inside the molecule then it would actually accept the H plus?”
While this student's hypothesis led them to decide that the assumption was invalid, which was the intended response, the intended approach of this question was to recognise that the parent statement made no reference to how strongly cations and anions are attracted to each other.
It was concerning that some participants felt they had to ‘un-train’ themselves of their chemical knowledge in order to properly engage with the DOT V2. Some participants highlighted that they found the WGCTA-S easier as they did not have to reflect on whether they were using their prior knowledge. However, many participants were asking themselves ‘why am I thinking what I’m thinking?’ which is indicative of high order metacognitive skills described by several critical thinking theoreticians (Facione, 1990, p. 10; Kuhn, 2000; Tsai, 2001). Students appear to be questioning their responses to the DOT V2 and whether their responses are based on their own pre-existing information or the information presented within the test as highlighted in the following statement.
“You had to think more oh am I using my own knowledge or what's just in the question? I was like so what is assumed to be background knowledge. What's background knowledge?”
The theme ‘Terms used to articulate cognitive processes’ described the participants applying the language from the instructions of the WGCTA-S and the DOT V2 to articulate their thought processes. In particular, participants were very aware of their prior knowledge, referring to this as ‘bias’.
In response to the questions in the ‘Developing Hypothesis’ section, which related to the probability of failure of an esterification reaction, one student identified that they attempted to view the questions from the perspective of individuals with limited scientific knowledge in order to minimise their prior chemical knowledge to influence their responses. There was much discussion of what was meant by the term ‘failure’ in the context of a chemical reaction and whether failure referred to the unsuccessful collisions at a molecular level or the absence of a product at the macroscopic level.
The students engaged in dialogues which helped refine the language they used in articulating their thoughts or helped them recognise thinking errors. This describes the final emergent theme of ‘Evidence of peer learning’. For example, when discussing their thought processes regarding a question in the ‘Deduction’ section of the WGCTA-S one student shared their strategy of having constructed mental Venn diagrams and had correctly identified how elements of the question related. This prompted others student to recognise the connection they had initially failed to make and reconsider their response.
Education group | Mean ATAR | Female | Male |
---|---|---|---|
First year | 87.10 (n = 104) | 54% (n = 64) | 43% (n = 51) |
Third year | 90.03 (n = 55) | 39% (n = 26) | 61% (n = 41) |
Postgraduates | 87.35 (n = 19) | 43% (n = 19) | 57% (n = 25) |
Academics | 89.58 (n = 3) | 38% (n = 15) | 58% (n = 23) |
Overall | 88.23 (n = 181) | 46% (n = 124) | 52% (n = 140) |
Education group | ||||
---|---|---|---|---|
1st year | 3rd year | P’grad | Academic | |
1st year (n = 119, Md = 16) | p < 0.001, r = 0.39 | p < 0.001, r = 0.64 | p < 0.001, r = 0.59 | |
3rd year (n = 67, Md = 21) | p < 0.001, r = 0.39 | p < 0.001, r = 0.30 | p = 0.003, r = 0.28 | |
Postgraduates (n = 44, Md = 23.5) | p < 0.001, r = 0.53 | p < 0.001, r = 0.30 | p = 0.691, r = 0.04 | |
Academic (n = 40, Md = 24) | p < 0.001, r = 0.59 | p = 0.003, r = 0.28 | p = 0.691, r = 0.04 |
Interestingly, there appeared to be no statistically significant difference in DOT V3 scores when comparing postgraduates and academics. If the assumption that critical thinking skill is correlated positively to time spent in tertiary education environments is valid, it is likely that the DOT V3 was not sensitive enough to detect any difference in critical thinking skill between postgraduates and academics.
Education group | Sex | ||
---|---|---|---|
Female | Male | Significance | |
1st year | n = 64, Md = 15 | n = 51, Md = 18 | p = 0.007, r = 0.25 |
3rd year | n = 26, Md = 20.5 | n = 41, Md = 21 | p = 0.228, r = 0.15 |
Postgraduate | n = 19, Md = 24 | n = 25, Md = 23 | p = 0.896, r = 0.02 |
Academic | n = 15, Md = 24 | n = 23, Md = 22 | p = 0.904, r = 0.02 |
Using Spearman's Rank-order correlation coefficient, there was a weak, positive correlation between DOT V3 score and ATAR score (ρ = 0.20, n = 194, p = 0.01), suggesting previous achievement was only a minor dependent with respect to score on the DOT V3. This correlation was in line with previous observations collected during testing of the DOT V2 where the correlation between previous academic achievement and performance on the test were found to have a small correlation (ρ = 0.23) but the sample size was small (n = 15). However, as the sample size used in the study of this relationship in the DOT V3 was much larger (n = 194) these findings suggested performance on the DOT V3 was only slightly correlated to previous academic achievement.
In order to determine the validity of the DOT V3 outside of Monash University the median scores of third year students from Monash University and Curtin University were compared using a Mann–Whitney U Test. The test revealed no significant difference in the score obtained by Monash University students (Md = 20, n = 44) and Curtin University students (Md = 22, n = 24), U = 670.500, z = 1.835, p = 0.07, r = 0.22. Therefore, the score obtained on the DOT V3 was considered independent of where the participant attended university. It is possible that an insufficient number of tests were completed, due to the opportunistic sampling from both universities, and obtaining equivalent sample sizes across several higher education institutions would confirm whether the DOT V3 performs well across higher education institutions.
The DOT V3 offers a tool with which to measure a student's critical thinking skills and the effect of any teaching interventions specifically targeting the development of critical thinking. The test is suitable for studying the development of critical thinking using a cross section of students, and may be useful in longitudinal studies of a single cohort. In summary, research conducted within this study provides a body of evidence regarding reliability and validity of the DOT test, and it offers the chemistry education community a valuable research and educational tool with respect to the development of undergraduate chemistry students’ critical thinking skills. The DOT V3 is included in Appendix 1 (ESI†) (Administrator guidelines and intended responses to the DOT V3 can be obtained upon email correspondence with the first author).
Whilst there is the potential to measure the development of critical thinking over a semester using the DOT V3, there is evidence to suggest that a psychological construct, such as critical thinking, does not develop enough for measureable differences to occur in the space of only a semester (Pascarella, 1999). While the DOT V3 could be administered to the same cohort of students annually to form the basis of a longitudinal study, there are many hurdles to overcome in such a study including participant retention and their developing familiarity with the test. Much like the CCTST and the WGCTA pre- and post-testing (Jacobs, 1999; Carter et al., 2015), at least two versions of the DOT V3 may be required for pre- and post-testing and for longitudinal studies. However, having a larger pool of questions does not prevent the participants from becoming familiar with the style of critical thinking questions. Development of an additional test would require further reliability and validity testing (Nunnally and Bernstein, 1994). However, cross-sectional studies are useful in identifying changes in critical thinking skills and the DOT V3 has demonstrated it is sensitive enough to discern between the critical thinking skills of first year or third year undergraduate chemistry students.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c8rp00130h |
This journal is © The Royal Society of Chemistry 2020 |