Development and validation of an instrument to measure undergraduate chemistry students’ critical thinking skills

Stephen M. Danczak *, Christopher D. Thompson and Tina L. Overton
School of Chemistry, Monash University, Victoria 3800, Australia. E-mail:

Received 18th May 2018 , Accepted 28th June 2019

First published on 12th July 2019

The importance of developing and assessing student critical thinking at university can be seen through its inclusion as a graduate attribute for universities and from research highlighting the value employers, educators and students place on demonstrating critical thinking skills. Critical thinking skills are seldom explicitly assessed at universities. Commercial critical thinking assessments, which are often generic in context, are available. However, literature suggests that assessments that use a context relevant to the students more accurately reflect their critical thinking skills. This paper describes the development and evaluation of a chemistry critical thinking test (the Danczak–Overton–Thompson Chemistry Critical Thinking Test or DOT test), set in a chemistry context, and designed to be administered to undergraduate chemistry students at any level of study. Development and evaluation occurred over three versions of the DOT test through a variety of quantitative and qualitative reliability and validity testing phases. The studies suggest that the final version of the DOT test has good internal reliability, strong test–retest reliability, moderate convergent validity relative to a commercially available test and is independent of previous academic achievement and university of study. Criterion validity testing revealed that third year students performed statistically significantly better on the DOT test relative to first year students, and postgraduates and academics performed statistically significantly better than third year students. The statistical and qualitative analysis indicates that the DOT test is a suitable instrument for the chemistry education community to use to measure the development of undergraduate chemistry students’ critical thinking skills.


The term ‘critical thinking’ or expressions referring to critical thinking skills and behaviours such as ‘analyse and interpret data meaningfully’ can be found listed in the graduate attributes of many universities around the world (Monash University, 2015; University of Adelaide, 2015; University of Melbourne, 2015; Ontario University, 2017; University of Edinburgh, 2017). Many studies highlight motivations for the development of higher education graduates’ critical thinking skills from the perspective of students and employers. When 1065 Australian employers representing a range of industries were surveyed it was found that employers considered critical thinking to be the second most important skill or attribute behind active learning, and that over 80% of respondents indicated critical thinking as ‘important’ or ‘very important’ as a skill or attribute in the workplace (Prinsley and Baranyai, 2015). International studies have revealed similar results (Jackson, 2010; Desai et al., 2016). These findings are indicative of the persistent needs of the job market for effective new graduates and the expectations that graduates are able to demonstrate skills such as critical thinking (Lowden et al., 2011).

A survey of 167 recent science graduates compared the development of a variety of skills at university to the skills used in the work place (Sarkar et al., 2016). It found that 30% of graduates in full-time positions identified critical thinking as one of the top five skills they would like to have developed further within their undergraduate studies. Students, governments and employers all recognise that not only is developing students’ critical thinking an intrinsic good, but that it better prepares them to meet and exceed employer expectations when making decisions, solving problems and reflecting on their own performance (Lindsay, 2015). Hence, it has become somewhat of an expectation from governments, employers and students that it is the responsibility of higher education providers to develop students’ critical thinking skills. Yet, despite the clear need to develop these skills, measuring student attainment of critical thinking is challenging.

The definition of critical thinking

Three disciplines dominate the discussion around the definition of critical thinking: philosophy, cognitive psychology and education research. Among philosophers, one of the most commonly cited definitions of critical thinking is drawn from the Delphi Report which defines critical thinking as ‘purposeful, self-regulatory judgement which results in interpretation, analysis, evaluation, and inference, as well as explanation of the evidential, conceptual, methodological, criteriological, or contextual considerations upon which that judgement is based’ (Facione, 1990, p. 2). Despite being developed over 25 years ago this report is still relevant and the definition provided is still commonly used in recent literature (Abrami et al., 2015; Desai et al., 2016; Stephenson and Sadler-Mcknight, 2016).

Cognitive psychologists and education researchers use the term critical thinking to describe a set of cognitive skills, strategies or behaviours that increase the likelihood of a desired outcome (Halpern, 1996b; Tiruneh et al., 2014). Psychologists typically investigate critical thinking experimentally and have developed a series of reasoning schemas with which to study and define critical thinking; conditional reasoning, statistical reasoning, methodological reasoning and verbal reasoning (Nisbett et al., 1987; Lehman and Nisbett, 1990). Halpern (1993) expanded on these schemas to define critical thinking as the thinking required to solve problems, formulate inferences, calculate likelihoods and make decisions.

In education research there is often an emphasis on critical thinking as a skill set (Bailin, 2002) or putting critical thought into tangible action (Barnett, 1997). Dressel and Mayhew (1954) suggested it is educationally useful to define critical thinking as the sum of specific behaviours which could be observed from student acts. They identify these critical thinking abilities as identifying central issues, recognising underlying assumptions, evaluating evidence or authority, and drawing warranted conclusions. Bailin (2002) raises the point that from a pedagogical perspective many of the skills or dispositions commonly used to define critical thinking are difficult to observe and, therefore, difficult to assess. Consequently, Bailin suggests that the concept of critical thinking should explicitly focus on adherence to criteria and standards to reflect ‘good’ critical thinking (Bailin, 2002, p. 368).

It appears that there are several definitions of critical thinking of equally valuable meaning (Moore, 2013). There is agreement across much of the field that meta-cognitive skills, such as self-evaluation, are essential to a well-rounded process of critical thinking (Glaser, 1984; Kuhn, 1999; Pithers and Soden, 2000). There are key themes such as ‘critical thinking: as judgement, as scepticism, as originality, as sensitive reading, or as rationality’ which can be identified across the literature. In the context of developing an individual's critical thinking it is important that these themes take the form of observable behaviours.

Developing students’ critical thinking

There are two extreme views regarding the teaching of critical thinking and the role subject-specific knowledge plays in its development: the subject specifist view and the subject generalist view. The subject specifist view, championed by McPeak (McPeak, 1981) states that thinking is never without context and thus courses designed to teach informal logic in an abstract environment provide no benefit to the student's capacity to think critically (McPeak, 1990). This perspective is supported by the work of prominent psychologists in the early 20th century (Thorndike and Woodworth, 1901a, 1901b, 1901c; Inhelder and Piaget, 1958).

In the latter half of the 20th century informal logic gained academic credence as it challenged the previous ideas of logic being related purely to deduction or inference and that there were, in fact, theories of argumentation and logical fallacies (Johnson et al., 1996). These theories began to be taught at universities as standalone courses free from any context in efforts to teach the structure of arguments and recognition of fallacies using abstract theories and symbolism. Cognitive psychology research lent evidence to the argument that critical thinking could be developed within a specific discipline and those reasoning skills were, at least to some degree, transferable to situations encountered in daily life (Lehman et al., 1988; Lehman and Nisbett, 1990). These perspectives form the basis of the subject generalist, who believed critical thinking can be developed independent of subject specific knowledge.

McMillan (1987) carried out a review of 27 empirical studies conducted at higher education institutions where critical thinking was taught, either in standalone courses or integrated into discipline-specific courses such as science. The review found that standalone and integrated courses were equally successful in developing critical thinking, provided critical thinking developmental goals were made explicit to the students. The review also suggested that the development of critical thinking was most effective when its principles were taught across a variety of discipline areas so as to make knowledge retrieval easier.

Ennis (1989) suggested that there are a range of approaches through which critical thinking can be taught: general, where critical thinking is taught separate from content or ‘discipline’; infusion, where the subject matter is covered in great depth and teaching of critical thinking is explicit; immersion, where the subject matter is covered in great depth but critical thinking goals are implicit; and mixed, a combination of the general approach with either the infusion or immersion approach. Ennis (1990) arrived at a pragmatic view to concede that the best critical thinking occurs within one's area of expertise, or domain specificity, but that critical thinking can still be effectively developed with or without discipline specific knowledge (McMillan, 1987; Ennis, 1990).

Many scholars still remain entrenched in the debate regarding the role discipline-specific knowledge has in the development of critical thinking. For example, Moore (2011) rejected the use of critical thinking as a catch-all term to describe a range of cognitive skills, believing that to teach critical thinking as a set of generalisable skills is insufficient to provide students with an adequate foundation for the breadth of problems they will encounter throughout their studies. Conversely, Davies (2013) accepts that critical thinking skills share fundamentals at the basis of all disciplines and that there can be a need to accommodate the discipline-specific needs ‘higher up’ in tertiary education via the infusion approach. However, Davies considers the specifist approach to developing critical thinking ‘dangerous and wrong-headed’ (Davies, 2013, p. 543), citing government reports and primary literature which demonstrates tertiary students’ inability to identify elements of arguments, and championing the need for standalone critical thinking courses.

Pedagogical approaches to developing critical thinking in chemistry in higher education range from writing exercises (Oliver-Hoyo, 2003; Martineau and Boisvert, 2011; Stephenson and Sadler-Mcknight, 2016), inquiry-based projects (Gupta et al., 2015), flipped lectures (Flynn, 2011) and open-ended practicals (Klein and Carney, 2014) to gamification (Henderson, 2010), and work integrated learning (WIL) (Edwards et al., 2015). Researchers have demonstrated the benefits of developing critical thinking skills across all first, second and third year programs of an undergraduate degree (Phillips and Bond, 2004; Iwaoka et al., 2010). Phillips and Bond (2004) indicated that such interventions help develop a culture of inquiry, and better prepare students for employment.

Some studies demonstrate the outcomes of teaching interventions via validated commercially available critical thinking tests, available from a variety of vendors for a fee (Abrami et al., 2008; Tiruneh et al., 2014; Abrami et al., 2015; Carter et al., 2015). There are arguments against the generalisability of these commercially available tests. Many academics believe assessments need to closely align with the intervention(s) (Ennis, 1993) and a more accurate representation of student ability is obtained when a critical thinking assessment is related to a students’ discipline, as they attach greater significance to the assessment (Halpern, 1998).

Review of commercial critical thinking assessment tools

A summary of the most commonly used commercial critical thinking skills tests, the style of questions used in each test and the critical thinking skills these tests claim to assess can be found in Table 1. For the purposes of this research the discussion will focus primarily on the tests and teaching tools used within the higher education setting. Whilst this list may not be exhaustive, it highlights those most commonly reported in the literature. The tests described are the California Critical Thinking Skills Test (CCTST) (Insight Assessment, 2013), the Watson-Glaser Critical Thinking Appraisal (WCGTA) (AssessmentDay Ltd, 2015), the Watson-Glaser Critical Thinking Appraisal Short Form (WGCTA-S) (Pearson, 2015), the Cornell Critical Thinking Test Level Z (CCTT-Z) (The Critical Thinking Co., 2017), the Ennis-Weir Critical Thinking Essay Test (EWCTET) (Ennis and Weir, 1985), and the Halpern Critical Thinking Assessment (HCTA) (Halpern, 2016). All of the tests use a generic context and participants require no discipline-specific knowledge in order to make a reasonable attempt on the tests. Each test is accompanied by a manual containing specific instructions, norms, validity, reliability and item analysis. The CCTST, WGCTA and WGCTA-S are available as two versions (often denoted a version A or B) to facilitate pre- and post-testing. The tests are generally untimed with the exception of the CCTT-Z and the EWCTET.
Table 1 Summary of commonly used commercially available critical thinking test
Test Question structure Critical thinking skills assessed
California Critical Thinking Skills test (CCTST) (Insight Assessment, 2013) 40 item multiple choice questions Analysis, evaluation, inference, deduction and induction
Watson-Glaser Critical Thinking Appraisal (WCGTA) (AssessmentDay Ltd, 2015) 80 item multiple choice questions Inference, deduction, drawing conclusions, making assumptions and assessing arguments
Watson-Glaser Critical Thinking Appraisal Short Form (WGCTA-S) (Pearson, 2015) 40 item multiple choice questions Inference, deduction, drawing conclusions, making assumptions and assessing arguments
Cornell Critical Thinking Test Level Z (CCTT-Z) (The Critical Thinking Co., 2017) 52 item multiple choice questions Induction, deduction, credibility, identification of assumption, semantics, definition and prediction in planning experiments
Ennis–Weir Critical Thinking Essay Test (EWCTET) (Ennis and Weir, 1985) Eight paragraphs which are letters containing errors in critical thinking and essay in response to these paragraphs. Understanding the point, seeing reasons and assumptions, stating one's point, offering good reasons, seeing other possibilities and responding appropriately and/or avoiding poor argument structure
Halpern Critical Thinking Assessment (HCTA) (Halpern, 2016) 20 scenarios or passages followed by a combination of 25 multiple choice, ranking or rating alternatives and 25 short answer responses Reasoning, argument analysis, hypothesis testing, likelihood and uncertainty analysis, decision making and problem solving

Several reviews of empirical studies suggest that the WGCTA is the most prominent test in use (Behar-Horenstein and Niu, 2011; Carter et al., 2015; Huber and Kuncel, 2016). However, the CCTST was developed much later than the WGCTA and recent trends suggest the CCTST has gained popularity amongst researchers since its inception. Typically, the tests are administered to address questions regarding the development of critical thinking over time or the effect of a teaching intervention. The results of this testing are inconsistent; some studies report significant changes while others report no significant changes in critical thinking (Behar-Horenstein and Niu, 2011). For example, Carter et al. (2015) found studies which used the CCTST or the WGCTA did not all support the hypothesis of improved critical thinking with time, with some studies reporting increases, and some studies reporting decreases or no change over time. These reviews highlight the importance of experimental design when evaluating critical thinking. McMillan (1987) reviewed 27 studies and found that only seven of them demonstrated significant changes in critical thinking. He concluded that tests which were designed by the researcher are a better measure of critical thinking, as they specifically address the desired critical thinking learning outcomes, as opposed to commercially available tools which attempt to measure critical thinking as a broad and generalised construct.

The need for a contextualised chemistry critical thinking test

The desire to develop the critical thinking skills of students at higher education institutions has led to the design and implementation of a breadth of teaching interventions, and the development of a range of methods of assessing the impact of these interventions. Many of these assessment methods utilise validated, commercially available tests. However, there is evidence to suggest that if these assessments are to be used with tertiary chemistry students, the context of the assessments should be in the field of chemistry so that the students may better engage with the assessment in a familiar context (McMillan, 1987; Ennis, 1993; Halpern, 1998), and consequently, students’ performance on a critical thinking test may better reflect their actual critical thinking abilities.

Several examples of chemistry specific critical thinking tests and teaching tools were found in the literature. However, while all of these tests and teaching activities were set within a chemistry context, they require discipline specific knowledge and/or were not suitable for very large cohorts of students. For example Jacob (2004) presented students with six questions each consisting of a statement requiring an understanding of declarative chemical knowledge. Students were expected to select whether the conclusion was valid, possible or invalid and provide a short statement to explain their reasoning. Similarly, Kogut (1993) developed exercises where students were required to note observations and underlying assumptions of chemical phenomena then develop hypotheses, and experimental designs with which to test these hypotheses. However, understanding the observations and underlying assumptions was dependent on declarative chemical knowledge such as trends in the periodic table or the ideal gas law.

Garratt et al. (1999) developed an entire book dedicated to developing chemistry critical thinking, titled ‘A Question of Chemistry’. In writing this book the authors took the view that thinking critically in chemistry draws on the generic skills of critical thinking and what they call ‘an ethos of a particular scientific method’ (Garratt et al., 2000, p. 153). The approach to delivering these questions ranged from basic multiple choice questions, to rearranging statements to generate a cohesive argument, or open-ended responses. The statements preceding the questions are very discipline specific and the authors acknowledge they are inaccessible to a lay person. Overall the chemistry context is used because ‘it adds to the students’ motivation if they can see the exercises are firmly rooted in, and therefore relevant to, their chosen discipline’ (Garratt et al., 2000, p. 166).

Thus, an opportunity has been identified to develop a chemistry critical thinking test which could be used to assist chemistry educators and chemistry education researchers in evaluating the effectiveness of teaching interventions designed to develop the critical thinking skills of chemistry undergraduate students. This study aimed to determine whether a valid and reliable critical thinking test could be developed and contextualised within the discipline of chemistry, yet independent of any discipline-specific knowledge, so as to accurately reflect the critical thinking ability of chemistry students from any level of study, at any university.

This study describes the development and reliability and validity testing of an instrument with which to measure undergraduate chemistry students’ critical thinking skills: The Danczak–Overton–Thompson Chemistry Critical Thinking Test (DOT test).


The development of the DOT test occurred in five stages (Fig. 1) including; the development of an operational definition of critical thinking, writing the pilot DOT test, and evaluation of three iterations of the test (DOT V1, DOT V2 and DOT V3). Data obtained in Stage 1 was compared with definitions and critical thinking pedagogies described in the literature to facilitate the development of a functional definition of critical thinking which informed the development of the DOT test. During Stage 2, data from Stage 1 was used to identify which elements of critical thinking the DOT test would measure, and identify commercially available test(s) suitable as guides for question development. Stage 3 consisted of internal reliability and content validity of the DOT v1. At Stage 4 a cross-sectional group of undergraduate chemistry students undertook the DOT V2 and the WGTCA-S to determine test-rest reliability, convergent validity and content validity of the DOT V2. Finally in Stage 5 internal reliability, item difficulty, criterion validity and discriminate validity of the DOT V3 was determined via a cross-sectional study. The test was administered to first year and third chemistry undergraduates at Monash University, third year chemistry undergraduates at Curtin University, and a range of PhD students, post-doctoral researchers and academics drawn from Monash University and an online chemistry education research community of practice.
image file: c8rp00130h-f1.tif
Fig. 1 Flow chart of methodology consisting of writing and evaluating the reliability and validity of iterations of the DOT test.

As critical thinking tests are considered to evaluate a psychometric construct (Nunnally and Bernstein, 1994) there must be supporting evidence of their reliability and validity (Kline, 2005; DeVellis, 2012). Changes made to each iteration of the DOT test and the qualitative and quantitative analysis performed at each stage of the study are described below.


All participants in the study were informed that their participation was voluntary, anonymous, would in no way affect their academic or professional records, and that they were free to withdraw from the study at any time. Participants were provided with an explanatory statement outlining these terms and all procedures were approved by Monash University Human Research Ethics Committee (MUHREC) regulations (project number CF16/568-2016000279).

Qualitative data analysis

The major qualitative element of this study sought to examine how participants engaged with critical thinking tests. It was interested in understanding the effect of using scientific terminology in a critical thinking test, what information participants perceived as important, what they believed the questions were asking them, and to understand the reasoning underpinning their responses to test questions. These qualitative investigations aimed to understand individuals’ perceptions of critical thinking and synthesise them into generalizable truths to improve the critical thinking test. Thus, the core research informed by Constructivism in terms of a framework that is applied within the context of chemistry education as presented in the review by Ferguson (2007). Research questions that are investigated using constructivism are based on the theory that ‘knowledge is constructed in the mind of the learner’ (Ferguson, 2007, p. 28). This knowledge is refined and shaped by the learners’ surroundings and social interactions in what is referred to as social constructivism (Ferguson, 2007, p. 29).

The qualitative data for DOT V1 and DOT V2 were treated as separate studies. The data for these studies were collected and analysed separately as described in the following. The data collected from focus groups throughout this research were recorded with permission of the participants and transcribed verbatim into Microsoft Word, at which point participants were de-identified. The transcripts were then imported into NVivo version 11 and an initial analysis was performed to identify emergent themes. The data then underwent a second analysis to ensure any underlying themes were identified. A third review of the data used a redundancy approach to combine similar themes. The final themes were then used for subsequent coding of the transcripts (Bryman, 2008).

Developing an operational definition of critical thinking

Previous work (Danczak et al., 2017) described that chemistry students, teaching staff and employers predominately identified deductive logic elements of critical thinking, such as ‘analysis’ and ‘problem solving’, and neglected to describe inductive logic elements, such as ‘judgement’ or ‘inference’, typical of the literature on critical thinking (Facione, 1990; Halpern, 1996b).

Therefore, students, teaching staff and employers did not define critical thinking in the holistic fashion of philosophers, cognitive psychologists or education researchers. In fact, very much in line with the constructivist paradigm (Ferguson, 2007), participants seem to have drawn on elements of critical thinking relative to the environments in which they had previously been required to use critical thinking. For example, students focused on analysis and problem solving, possibly due to the assessment driven environment of university, whereas employers cited innovation and global contexts, likely to be reflective of a commercial environment.

The definitions of critical thinking in the literature (Lehman et al., 1988; Facione, 1990; Halpern, 1996b) cover a wide range of skills and behaviours. These definitions often imply that to think critically necessitates that all of these skills or behaviours be demonstrated. However, it seems almost impossible that all of these attributes could be observed at a given point in time, let alone assessed (Dressel and Mayhew, 1954; Bailin, 2002). Whilst the students in Danczak et al. (2017) used ‘problem solving’ and ‘analysis’ to define critical thinking, it does not necessarily mean that their description accurately reflects the critical thinking skills they have actually acquired, but rather their perception of what critical thinking skills they have developed. Therefore, to base a chemistry critical thinking test solely on analysis and problem solving skills would lead to the omission of the assessment of other important aspects of critical thinking.

To this end, the operation definition of critical thinking acknowledges the analysis and problem solving focus that students predominately used to describe critical thinking, whilst expanding into other important aspects of critical thinking such as inference and judgement. Consequently, guidance was sought from existing critical thinking assessments, as described below.

Development of the Danczak–Overton–Thompson chemistry critical thinking test (DOT test)

At the time of test development only a privately licenced version of the California Critical Thinking Skills Test (CCTST) (Insight Assessment, 2013) and a practice version of the Watson-Glaser Critical Thinking Appraisal (WGCTA) (AssessmentDay Ltd, 2015) were able to be accessed. Several concessions had to be made with respect to the practicality of using commercial products; cost, access to questions and solutions, and reliability of assessment. The WGCTA style questions were chosen as a model for the DOT Test since the practice version was freely available online with accompanying solutions and rationale for those solutions (AssessmentDay Ltd, 2015). The WGCTA is a multiple choice test, and while open-ended response questions more accurately reflect the nature of critical thinking and provide the most reliable results, as the number of participants increases, the costs in terms of time and funds become challenging and multiple choice questions become more viable (Ennis, 1993).

The WGCTA is an 85 item test which has undergone extensive scrutiny in the literature since its inception in the 1920s (Behar-Horenstein and Niu, 2011; Huber and Kuncel, 2016). The WGCTA covers the core principles of critical thinking divided into five sections: inference, assumption identification, deduction, interpreting information and evaluation of arguments. The questions test each aspect of critical thinking independent of context. Each section consists of brief instructions and three or four short parent statements. Each parent statement acts as a prompt for three to seven subsequent questions. The instructions, parent statements, and the questions themselves were concise with respect to language and reading requirements. The fact that the WGCTA focused on testing assumptions, deductions, inferences, analysing arguments and interpreting information was an inherit limitation in its ability to assess all critical thinking skills and behaviours. However, these elements are commonly described by many definitions of critical thinking (Facione, 1990; Halpern, 1996b).

The questions from the WGCTA practice test were analysed and used to structure the DOT test. The pilot version of the DOT test (DOT P) was initially developed with 85 questions, set within a chemistry or science context, and using similar structure and instructions to the WGCTA with five sub-scales: making assumptions, analysing arguments, developing hypotheses, testing hypotheses, and drawing conclusions.

Below is an example of paired statements and questions written for the DOT P. This question is revisited throughout this paper to illustrate the evolution of the test throughout the study. The DOT P used essentially the same instructions as provided on the WGCTA practice test. In later versions of the DOT Test the instructions were changed as will be discussed later.

The parent statement of the exemplar question from the WGCTA required the participant to recognise proposition A and proposition B are different, and explicitly states that there is a relationship between proposition A and proposition C. This format was used to generate the following parent statement:

A chemist tested a metal centred complex by placing it in a magnetic field. The complex was attracted to the magnetic field. From this result the chemist decided the complex had unpaired electrons and was therefore paramagnetic rather than diamagnetic.

In writing an assumption question for the DOT test, paramagnetic and diamagnetic behaviour of metal complexes replaced propositions A and B. The relationship between propositions A and C was replaced with paramagnetic behaviour being related to unpaired electrons. The question then asked if it is a valid or invalid assumption that proposition B is not related to proposition C.

Diamagnetic metals centred complexes do not have any unpaired electrons.

The correct answer was a valid assumption as this question required the participant to identify proposition B and C were not related. The explanation for the correct answer was as follows:

The paragraph suggests that if the complex has unpaired electrons it is paramagnetic. This means diamagnetic complexes likely cannot have unpaired electrons.

All 85 questions on the WGCTA practice test were analysed in the manner exemplified above to develop the DOT P. In designing the test there were two requirements that had to be met. Firstly, the test needed to be able to be completed comfortably within 30 minutes to allow it to be administered in short time frames, such as at the end of laboratory sessions, and to increase the likelihood of voluntary completion by students. Secondly, the test needed to be able to accurately assess the critical thinking of chemistry students from any level of study, from first year general chemistry students to final year students. To this end, chemistry terminology was carefully chosen to ensure that prior knowledge of chemistry was not necessary to comprehend the questions. Chemical phenomena were explained and contextualised completely within the parent statement and the questions.

DOT pilot study: content validity

Members of the Monash Chemistry Education Research Group attempted the DOT P and their feedback was collected through informal discussion with the researcher and was considered an exploratory discussion of content validity. The group consisted of two teaching and research academics, one post-doctoral researcher and two PhD students. The group received the intended responses to the DOT P questions and identified which questions they felt did not illicit the intended critical thinking behaviour, citing poor wording and instances where the chemistry was poorly conveyed. Participants also expressed frustrations between the selection of five potential options in the ‘Develop Hypotheses’ section, having trouble distinguishing between options such as ‘true’ or ‘probably true’ and ‘false’ or ‘probably false’.

The test took in excess of 40 minutes to complete. Therefore, questions which were identified as unclear, which did not illicit the intended responses, or caused misconceptions of the scientific content were removed. The resulting DOT V1 contained seven questions relating to ‘Making Assumptions’, seven questions relating to ‘Analysing Arguments’, six questions relating to ‘Developing Hypotheses’, five questions relating to ‘Testing Hypotheses’ and five questions relating to ‘Drawing Conclusions’. The terms used to select a multiple choice option were written in a manner more accessible to science students, for example using terms such as ‘Valid Assumption’ or ‘Invalid Assumption’ instead of ‘Assumption Made’ or ‘Assumption Not Made’. Finally, the number of options in the ‘Developing Hypotheses’ section were reduced from five to three of ‘likely to be an accurate inference’, ‘insufficient information to determine accuracy’ and ‘unlikely to be an accurate inference’.

Data treatment of responses to the DOT test and WGCTA-S

Several iterations of DOT test (and the WGCTA-S in Stage 4) were administered to a variety of participants throughout this study. The responses to questions on these tests and test scores were used in the statistical evaluations of the DOT test. The following section outlines the data treatment and statistical approaches applied to test responses throughout this study.

All responses to the DOT test and WGCTA-S were imported into IBM SPSS statistics (V22). Frequency tables were generated to determine erroneous or missing data. Data was considered erroneous when participants had selected ‘C’ to questions which only contained options ‘A’ or ‘B’, or when undergraduate students who identified their education/occupation as that of an academic. The erroneous data was deleted and treated as missing data points. In each study a variable was created to determine the sum of unanswered questions (missing data) for each participant. Pallant (2016, pp. 58–59) suggests a judgement call is required when considering missing data and whether to treat certain cases as genuine attempts to complete the test or not. In the context of this study a genuine attempt was based on the number of questions a participant left unanswered. Participants who attempted at least 27 questions were considered to have genuinely attempted the test.

Responses to all DOT Test (and WGTCA-S) questions were coded as correct or incorrect responses. Upon performing descriptive statistics, the DOT V1 scores were found to exhibit a normal (Gaussian) distribution whereas the DOT V3 exhibited a non-parametric (not-normal) distribution. In light of these distributions it was decided to treat all data obtained as non-parametric.

Internal reliability of each iteration of the DOT test was determined by calculating Cronbach's α (Cronbach, 1951). Within this study the comparison between two continuous variables was made using the non-parametric equivalent of a Pearson's r, Spearman's Rank Order test as recommended by Pallant (2016). Continuous variables included DOT test scores and previous academic achievement as measured by tertiary entrance scores (ATAR score). When comparing DOT test scores between education groups, which were treated as categorical variables, the non-parametric equivalent of t-test, a Mann–Whitney U test was used. When comparing a continuous variable of the same participants taken at different times using Wilcoxon signed rank test, the non-parametric equivalent of a paired t-test, was used.

DOT V1: internal reliability and content validity

Initial internal reliability testing of the DOT V1 was carried out with first year chemistry students in semester 1 of 2016. They were enrolled in either Chemistry I, a general chemistry course, or Advanced Chemistry I, for students who had previously studied chemistry. Content validity was evaluated through a focus group of a science education research community of practice at Monash University. This sample was self-selected as participation in the community of practice was an opt-in activity for academics with an inherit interest in education. Furthermore data collection was convenient and opportunistic as the availability of some members was limited with some unable to attend both sessions.

Internal reliability method

The DOT V1 was administered to the entire first year cohort of approximately 1200 students at the conclusion of a compulsory laboratory safety induction session during the first week of semester. Students completed the test on an optically read multiple choice answer sheet. 744 answer sheets were submitted and that data imported into Microsoft Excel and treated according to the procedure outlined the data treatment section above. 615 cases were used for statistical analysis after the data was filtered. As approximately half the cohort genuinely attempted the DOT V1, the data produced was likely to be representative of the overall cohort (Krejcie and Morgan, 1970). However, the data may be reflective of self-selecting participants who may be high achieving students or inherently interested in developing their critical thinking skills. At the time, demographic data such as sex, previous chemical/science knowledge, previous academic achievement and preferred language(s) were not collected and it is possible one or more of these discriminates may have impacted performance on the test. Demographic data was subsequently collected and analysed for later versions of the DOT. Descriptive statistics found the DOT V1 scores exhibited a normal (Gaussian) distribution. Internal consistency was then determined by calculating Cronbach's α (Cronbach, 1951).

Content validity method

The focus groups were conducted over two separate one hour meetings consisting of fifteen and nine participants respectively. Only five participants were common to both sessions. Participants were provided with the DOT V1 and asked to complete the questions from a given section. After completing a section on the DOT V1, participants were asked to discuss their responses, their reasoning for their responses and comment how they might improve the questions to better elicit the intended response. The focus groups were recorded, transcribed and analysed in line with the procedures and theoretical frameworks described previously.

DOT V2: test–retest reliability, convergent validity and content validity

Several changes were made based upon analysis of the data obtained from students and academics to produce the DOT V2. Many parent statements were rewritten to include additional information in order to reduce the need to draw on knowledge external to the questions. For example, in the questions that used formal charges on anions and cations, statements were included to describe the superscripts denoting charges: ‘Carbonate (CO32−) has a formal charge of negative 2.’

The most extensive rewriting of the parent statements occurred in the ‘Analysing Arguments’ section. The feedback provided from the focus groups indicated that parent statements did not include sufficient information to adequately respond to the questions.

Additional qualifying statements were added to several questions in order to reduce ambiguity. In the parent statement of the exemplar question the first sentence was added to eliminate the need to understand that differences exist between diamagnetic and paramagnetic metal complexes, with respect to how they interact with magnetic fields:

Paramagnetic and diamagnetic metal complexes behave differently when exposed to a magnetic field. A chemist tested a metal complex by placing it in a magnetic field. From the result of the test the chemist decided the metal complex had unpaired electrons and was therefore paramagnetic.

Finally, great effort was made in the organisation of the DOT V2 to guide the test taker through a critical thinking process. Similar to Halpern's approach to analysing an argument (Halpern, 1996a), Halpern teaches that an argument is comprised of several conclusions, and that the credibility of these conclusions must be evaluated. Furthermore, the validity of any assumptions, inferences and deductions used to construct the conclusions within an argument need to be analysed. To this end the test taker was provided with scaffolding from making assumptions to analysing arguments in line with Halpern's approach.

Test–retest reliability and convergent validity method

A cross-sectional study of undergraduate students was used to investigate test–retest reliability, content and convergent validity of DOT V2. Participants for the study were recruited by means of advertisements in Monash chemistry facilities and learning management system pages. The invitation was open to any current Monash student currently studying a chemistry unit or had previously completed a chemistry unit. 20 students attended the first day of the study and 18 of these students attended the second day. The initial invitation would have reached almost 2000 students. Therefore, the findings from the 18 students who participated in both days of the study were of limited generalisability.

On the first day, demographic data was collected: sex, dominant language, previous academic achievement using tertiary entrance scores (ATAR), level of chemistry being studied and highest level of chemistry study completed at Monash University. Students completed the DOT V2 using an optical reader multiple choice answer sheet. This was followed by completion of the WGCTA-S in line with procedures outlined by the Watson-Glaser critical thinking appraisal short form manual (2006). The WGCTA-S was chosen for analysis of convergent validity, as it was similar in length to the DOT V2 and was intended to measure the same aspects of critical thinking. The fact that participants completed the DOT V2 and then the WGCTA-S may have affected the participants’ performance on the WGCTA-S. This limitation will be addressed in the results.

After a brief break, the participants were divided into groups of five to eight students and interviewed about their overall impression of the WGCTA-S and their approach to various questions. Interviewers prevented the participants from discussing the DOT V2 so as not to influence each other's responses upon retesting.

On the second day participants repeated the DOT V2. DOT V2 attempts were completed on consecutive days to minimise participant attrition. Upon completion of the DOT V2 and after a short break, participants were divided into two groups of nine and interviewed about their impressions of the DOT V2, how they approached various questions and comparisons between the DOT V2 and WGCTA-S.

Responses to the tests and demographic data were imported into IBM SPSS (V22). Data was treated in accordance with the procedure outlined earlier. With the exception of tertiary entrance score, there was no missing or erroneous demographic data. Spearman's rank order correlations were performed comparing ATAR scores to scores on the WGCTA-S and the DOT V2. Test–retest reliability was determined using a Wilcoxon signed rank test (Pallant, 2016, pp. 234–236, pp. 249–253), When the scores of the tests taken at different times have no significant difference, as determined by a p value greater than 0.05, the test can be considered to have acceptable test–retest reliability (Pallant, 2016, p. 235). Acceptable test–retest reliability does not imply that test attempts are equivalent. Rather, good test–retest reliability suggests that the precision of the test to measure the construct of interest is acceptable. Median scores of the participants’ first attempt of the DOT V2 were compared with the median score of the participants’ second attempt of the DOT V2. To determine the convergent validity of the DOT V2 the relationship between scores on the DOT V2 and performance on the WGCTA-S was investigated using Spearman's Rank order correlation.

Content validity method

In each of the interviews, participants were provided with blank copies of the relevant test (the WGCTA-S on day 1 and the DOT V2 on day 2). Participants were encouraged to make any general remarks or comments with respect to the tests they had taken, with the exception of the interviews on day 1, where interviewers prevented any discussion of the questions with the DOT V2. Beyond the guidelines described in this section, the interviewers did not influence the participants’ discussion, correct their reasoning or provide the correct answers to either test.

After approximately 15 minutes of participants freely discussing the relevant test, the interviewers asked the participants to look at a given section on a test, for example the ‘Testing Hypotheses’ section of the DOT V2, and identify any questions they found problematic. In the absence of students identifying any problematic questions, the interviewers used a list of questions from each test to prompt discussion. The participants were then asked as a group:

• ‘What do you think the question is asking you?

• ‘What do you think is the important information in this question?

• ‘Why did you give the answer(s) you did to this question?

The interview recordings were transcribed and analysed in line with the procedures and theoretical frameworks described previously to result in four distinct themes which were used to code the transcripts.

DOT V3: internal reliability, criterion validity content validity and discriminate validity

Detailed instructions and a cover page were added to the DOT V3. Participants from the study of the DOT V2 drew heavily on any worked examples in the introduction of each section. Carefully written examples were provided in the introduction of each section of the DOT V3.

Many scientific terms were either simplified or removed in the DOT V3. In the case of the exemplar question, the focus was moved to an alloy of thallium and lead rather than a ‘metal complex’. Generalising this question to focus on an alloy allowed these questions to retain scientific accuracy and reduce the tendency for participants to draw on knowledge outside the information presented in the questions:

Metals which are paramagnetic or diamagnetic behave differently when exposed to an induced magnetic field. A chemist tested a metallic alloy sample containing thallium and lead by placing it in an induced magnetic field. From the test results the chemist decided the metallic alloy sample repelled the induced magnetic field and therefore was diamagnetic.

This statement was then followed by the prompt asking the participant to decide if the assumption presented was valid or invalid:

Paramagnetic metals do not repel induced magnetic fields.

Several terms were rewritten as their use in science implied assumptions as identified by the student focus groups. These assumptions were not intended and hence the questions were reworded. For example, question 14 asked whether a ‘low yield’ would occur in a given synthetic route. The term ‘low yield’ was changed to ‘an insignificant amount’ to remove any assumptions regarding the term ‘yield’.

The study of the DOT V3 required participants to be drawn from several distinct groups in order to assess criterion and discriminate validity. For the purpose of criterion validity, the DOT V3 was administered to first year and third year undergraduate chemistry students, honours and PhD students and post-doctoral researchers at Monash University, and chemistry education academics from an online community of practice. Furthermore, third year undergraduate chemistry students from another Australian higher education institution (Curtin University) also completed the DOT V3 to determine discriminate validity with respect to performance of the DOT V3 outside of Monash University.


The DOT V3 was administered to undergraduate students in paper format. The test was presented in face-to-face activities such as lectures and workshops, or during laboratory safety inductions. Students responded directly onto the test. First year participants were drawn from the general chemistry unit run in semester 1 2017. The DOT V3 was administered to 576 these students. 199 students attempted the test, representing approximately 19% of the first year cohort.

Third year participants were drawn from an advanced inorganic chemistry course at Monash University and a capstone chemical engineering course at Curtin University. 54 students (37%) responded to the DOT test at Monash University. The 23 students who completed the DOT V3 at Curtin University represented the entire cohort.

Post-doctoral researchers, honours and PhD students from Monash University were invited to attempt the DOT V3. 40 participants drawn from these cohorts attended a session where they completed the DOT V3 in a paper format, marking responses directly onto the test. All cohorts who completed the test in paper format required approximately 20 to 30 minutes to complete the test.

An online discussion group of approximately 300 chemistry academics with an interest in education, predominately from Australia, the UK and Europe, were invited to complete an online version of the DOT V3. Online completion was untimed and 46 participants completed the DOT V3.

Treatment of data

All responses to the DOT V3 were imported into IBM SPSS (V22) as 385 cases. Data was treated in accordance with the procedure outlined above. 97 cases contained missing data. A further 18 cases were excluded from analysis as these participants identified as second year students. A total of 288 cases were considered to be genuine attempts and used for statistical analysis. As will be discussed later there was no statistical difference between the performance of third year students from Monash and Curtin Universities, and therefore the third year students were treated as one group. The Honours, PhD and Post-Doctoral variables were combined into the education group ‘Postgraduates’ as the data sets were so small.

Descriptive statistics of the 270 DOT V3 results revealed a larger proportion scores were above the mean, thus the data was considered non-parametric for the purposes of reliability and validity statistical analysis. Internal consistency was then determined by calculating Cronbach's α (Cronbach, 1951). The five sub-scales of the DOT V3 (Making Assumptions, Developing Hypotheses, Testing Hypotheses, Drawing Conclusion and Analysing Arguments) underwent a principle component analysis to determine the number of factors affecting the DOT V3.

Criterion validity method

Several Mann–Whitney U tests were conducted to determine the criterion validity of the DOT V3. The hypothesis was that academics were better critical thinkers than postgraduates, that postgraduates were better critical thinkers than third year undergraduates, and third year undergraduates were better critical thinkers than first year undergraduates. Based on this assumption, the hypothesis was that there would be a statically significant improvement in median DOT V3 scores relative to experience within the tertiary education system.

Discriminate validity method

Discriminate validity of the DOT V3 was based on whether the achievement on the DOT V3 was independent of previous academic achievement, and which university the participant attended. Finally, the effect of the higher education institution which the participant attended was considered using a Mann–Whitney U test comparing the difference in the median score obtained on the DOT V3 of 3rd year Monash University chemistry students and the median score obtained on the DOT V3 of 3rd year Curtin University chemistry students.

Results and discussion

As the DOT test was considered to be a psychometric test, validity and reliability testing were essential to ensure students’ critical thinking skills are accurately and precisely measured (Nunnally and Bernstein, 1994; Kline, 2005; DeVellis, 2012). Three separate reliability and validity studies were conducted throughout the course of test development, resulting in three iterations of the DOT test, the results of which are discussed here.

DOT V1: internal reliability and content validity

The internal consistency of the DOT V1 as determined via Cronbach's α suggested the DOT V1 had limited internal reliability (α = 0.63). Therefore, the sub-scales within the DOT V1 could not confidently be added together to measure critical thinking skill.

The academic participants in the focus groups navigated the questions on the DOT V1 to arrive at the intended responses. However, there was rarely consensus within the group and a minority, usually one or two participants, disagreed with the group. The difficulties the academics had in responding to the DOT V1 were made clear from four themes which emerged from the analysis: ‘Instruction Clarity’, ‘Wording of the Question(s)’, ‘Information within the Statement’ and ‘Prior Knowledge’ (Table 2). The last theme was generally found to be associated with the other themes.

Table 2 Themes identified in the qualitative analysis of the academic focus groups for DOT V1
Theme (theme representation %) Description Example
Instruction clarity (15%) Insufficient information within the instructions (sometimes due to superficial reading) …what do you mean by ‘is of significant importance’?
Wording of the question (41%) Attributing meaning to specific words within a question …the language gives it away cause you say ‘rather than’.
Information within the parent statement (15%) Adequate or insufficient information in the parent statement to response to the questions Basically what you’re providing in the preamble is the definition, right?
Prior knowledge (30%) The using or requiring use of prior scientific knowledge from outside the test I didn’t think there was any information about the negative charge… I didn’t know what that meant, so I tried to go on the text.

The theme of ‘Instruction Clarity’ was used to describe when participants either had difficulty interpreting the instructions or intentionally ignored the instructions. Several participants self-reported their tendency to only scan the instructions without properly reading them, or did not read the statement preceding a question in its entirety. When this behaviour occurred, academics were quick to draw on outside knowledge. This theme identified the need for clarity of the instructions and providing relevant examples of what was meant by terms such as ‘is of significant importance’ or ‘is not of significant importance’.

The theme of ‘Wording of the Questions’ referred to evaluating the meaning and use of particular words within the questions or the parent statements. The wording of several questions led to confusion, causing the participants to draw on outside knowledge. Unclear terminology hindered non-science participants (education developers) from attempting the questions, and was further compounded by the use of terms such as ‘can only ever be’. For example, the use of the term ‘rather than’ confused participants when they knew a question had more than two alternatives.

The theme ‘Information within the Statement’ referred to the participants’ perceptions of the quality and depth of information provided in the parent statements. Participants suggested some test questions appeared to be non-sequiturs with respect the corresponding parent statements. Participants felt they did not have enough information to make a decision, and the lack of clarity in the instructions further compounded the problem for participants.

The theme ‘Prior Knowledge’ identified instances when participants had drawn on information not provided in the DOT V1 to answer the questions. Several issues regarding prior knowledge emerged from the discourse. Participants identified that there were some assumptions made about the use of the chemical notations. Finally some participants highlighted that having prior knowledge, specifically in science and/or chemistry, was to their detriment when attempting the questions.

DOT V2: test–retest reliability, convergent and content validity

The study of the DOT V2 was interested in test–retest reliability, convergent, and content validity of the DOT V2 compared with the WGTCA-S. The participants for this cross-sectional study were comprised of equal numbers of male and female students, 17 participants identified English as their preferred language (85%) and three participants English as their second language (15%). Their ages ranged from 18 to 21 with a median age of 19. Six students were undertaking first year chemistry courses (30%), five were taking second year courses (25%), seven were taking third year courses (35%), one taking fourth year (honours research) (5%), and one currently not studying any chemistry (5%).

A total of 15 participants provided their tertiary entrance score (ATAR), as a measure of previous academic achievement. There is some discussion in the literature which suggests university entrance scores obtained in high school do not reflect intelligence and cognitive ability (Richardson, Abraham and Bond, 2012). However, a comparison of previous academic achievement, reported via ATAR scores, revealed a small positive correlation with scores obtained on the DOT V2 (ρ = 0.23) and a moderately positive correlation with scores obtained on the WGCTA-S (ρ = 0.47).

Test–retest reliability

18 participants took part in test–retesting of the DOT V2. A Wilcoxon signed rank test revealed no statistically significant change in score of the DOT V2 due to test–retesting, with a very small effect size (z = −0.11, p = 0.91, r = 0.03). The median score on the first day of testing (22.0) was similar to the median score on the second day of testing (22.5), suggesting good test–retest reliability. The main concern regarding these findings was that the two attempts of the DOT V2 were made on consecutive days. This was done in an attempt to reduce participant attrition but risked participants responding exactly as they did in their first attempt of the DOT V2 from memory. In fact, students identified that they felt they were remembering the answers from their previous attempt.

The second time it felt like I was just remembering what I put down the day before.

The WGCTA-S manual (Watson and Glaser, 2006, pp. 30–31) listed three studies where test–retesting intervals of three months, two weeks or four days. Each of these studies reported test–retest correlations ranging from p = 0.73 to 0.89, and larger p values were associated with shorter time frames between test–retesting. However, as the p value of the Wilcoxon's signed rank test was sufficiently large (0.91), it was unlikely that the DOT V2 would have exhibited poor test–retest reliability were it to be administered over a longer time interval.

Convergent validity

Analysis of convergent validity was conducted using the WGCTA-S and the attempts of the DOT V2 from the first day, as there was no statistical significance between the scores of the two attempts of the DOT V2. The relationship between performance on the DOT V2 and performance on the WGCTA-S was investigated using Spearman's rank-order correlation to reveal a small positive correlation between the two variables (ρ = 0.31, n = 18, p = 0.21). The WGCTA users guide (Watson and Glaser, 2006) suggests the correlation with other reasoning tests should reflect the degree of similarity between the tests. In fact, Pearson reports a range of correlations from 0.48 to 0.70 when comparing the WGCTA with other reasoning tests (Watson and Glaser, 2006, pp. 41–42). The small correlation of the DOT V2 relative to the WGCTA-S did suggest that the DOT V2 was not necessarily measuring the same aspects of critical thinking as the WGCTA-S as was initially presumed. The modest positive correlation may have been due to the small number (n = 20) of self-selected participants and did suggest the DOT V2 exhibited some degree of convergent validity. A larger number of participants may have provided more convincing data. For example, in a study of the physics critical thinking test based on the HCTA, as few as 45 participants was sufficient to obtain statistically significant data (Tiruneh et al., 2016).

Content validity

The group interviews provided evidence participants recognised the importance of the context of the tests and that they felt more comfortable doing the DOT V2 as they attached greater significance to the chemistry context.

I found the questions (on the DOT V2) a bit more interesting and engaging in general where as this one (WGCTA-S) seemed a bit more clinical.

However, two participants did express their preference for the WGCTA-S citing the detailed examples in the instructions of each section, and their frustration when attempting the DOT V2, requiring them to recognise whether they were drawing on chemistry knowledge outside of the question.

The qualitative analysis of the student focus group data provided useful insight regarding the content validity of the DOT V2. When discussing their responses, the participants often arrived at a group consensus on the correct answers for both the DOT V2 and the WGCTA-S. Rarely did the participants initially arrive at a unanimous decision. In several instances on both tests, there were as many participants in favour of the incorrect response as there were participants in favour of the correct response. Four themes emerged from the analysis of the transcripts which are presented in Table 3.

Table 3 Themes identified in the qualitative analysis of the student focus groups for the DOT V2 and the WGCTA-S
Theme (theme representation %) Description Example
Strategies used to attempt test questions (46%) Approaches participants took including dependence on examples, evaluating key words, construction of rules or hypothetical scenarios …you could come back to it and then look at how each of the example questions were answered…
Difficulties associate with prior knowledge (21%) Participants consciously aware of their prior knowledge, either attempting to restrict its use or their prior knowledge is in conflict with their response to a given question It's quite difficult to leave previous knowledge and experience off when you’re trying to approach these (questions).
Terms used to articulate cognitive processes (22%) Evidence of critical thinking and critical thinking terminology the participants were exposed to throughout the focus groups, in particular ‘bias’ I think like the first section…was more difficult than the other because I think I had more bias in that question.
Evidence of peer learning (11%) Discourse between participants in which new insight was gained regarding how to approach test questions To me, the fact that you know it starts talking about…fall outside of the information and is therefore an invalid assumption.

The theme ‘Strategies used to attempt test questions’ describes both the participants’ overall practice and increasing familiarity with the style of questions, and also the specific cognitive techniques used in attempting to answer questions. The approach participants used when performing these tests was reflective of the fact they became more familiar with the style of questions and their dependence on the examples provided diminished.

Some participants had difficulty understanding what was meant by ‘Assumption Made’ and ‘Assumption Not Made’ in the ‘Recognition of Assumption’ section in the WGCTA-S and drew heavily on the worked examples provided in the introduction to the section. At the conclusion of this study, these participants had completed three critical thinking tests and were becoming familiar with how the questions were asked and what was considered a correct response. However, test–retesting with the DOT V2 indicated that there was no change in performance.

There was concern that providing detailed instructions on the DOT test may in fact develop the participants’ critical thinking skills in the process of attempting to measure it. For example, a study conducted (Heijltjes et al., 2015) with 152 undergraduate economics students who were divided into six approximately equal groups found that participants who were exposed to the written instructions performed on average 50% better on the critical thinking skills test compared to those who did not receive written instructions. It does seem plausible that a similar result would occur with the DOT test, and evaluating the impact of instructions and examples using control and test groups would be beneficial in future studies of the DOT test.

The second aspect of this theme was the application of problem solving skills and the generation of hypothetical scenarios whereby deductive logic could be applied. The following quote described an example of a participant explicitly categorising the information they were provided with in the parent statements and systematically analysing those relationships to answer the questions.

I find that with (section) three, deduction, that it was really easy to think in terms of sets, it was easier to think in terms set rather than words, doodling Venn diagrams trying to solve these ones.

The Delphi report considers behaviours described by this theme to be part of the interpretation critical thinking skill which describes the ability ‘to detect … relationships’ or ‘to paraphrase or make explicit … conventional or intended meaning’ of a variety of stimuli (Facione, 1990, p. 8). Others consider this behaviour to be more reflective of problem solving skills, describing the behaviour as ‘understanding of the information given’ in order to build a mental representation of the problem (OECD, 2014, p. 31). These patterns of problem solving were most evident in the discussions in response to the WGCTA-S questions.

With respect to the DOT V2, participants had difficulty understanding the intended meaning of the questions without drawing on the previous knowledge of chemistry and science. For example, there were unexpected discussions of the implication of the term ‘lower yield’ in a chemistry context and the relationship to a reaction failing. Participants pointed out underlying assumptions associated with terms such as ‘yield’ highlighting that the term ‘yield’ was not necessarily reflective of obtaining a desired product.

The theme ‘Difficulties associated with prior knowledge’ described when participants drew on knowledge from outside the test in efforts to respond to the questions. In both the WGCTA-S and the DOT V2, the instructions clearly stated to only use the information provided within the parent statements and the questions. These difficulties were most prevalent when participants described their experiences with the DOT V2. For example, the participants were asked to determine the validity of a statement regarding the relationship between the formal charge of anions and how readily anions accept hydrogen ions. In arriving at their answer, one student drew on their outside knowledge of large molecules such as proteins to suggest:

What if you had some ridiculous molecule that has like a 3 minus charge but the negative zones are all the way inside the molecule then it would actually accept the H plus?

While this student's hypothesis led them to decide that the assumption was invalid, which was the intended response, the intended approach of this question was to recognise that the parent statement made no reference to how strongly cations and anions are attracted to each other.

It was concerning that some participants felt they had to ‘un-train’ themselves of their chemical knowledge in order to properly engage with the DOT V2. Some participants highlighted that they found the WGCTA-S easier as they did not have to reflect on whether they were using their prior knowledge. However, many participants were asking themselves ‘why am I thinking what I’m thinking?’ which is indicative of high order metacognitive skills described by several critical thinking theoreticians (Facione, 1990, p. 10; Kuhn, 2000; Tsai, 2001). Students appear to be questioning their responses to the DOT V2 and whether their responses are based on their own pre-existing information or the information presented within the test as highlighted in the following statement.

You had to think more oh am I using my own knowledge or what's just in the question? I was like so what is assumed to be background knowledge. What's background knowledge?

The theme ‘Terms used to articulate cognitive processes’ described the participants applying the language from the instructions of the WGCTA-S and the DOT V2 to articulate their thought processes. In particular, participants were very aware of their prior knowledge, referring to this as ‘bias’.

In response to the questions in the ‘Developing Hypothesis’ section, which related to the probability of failure of an esterification reaction, one student identified that they attempted to view the questions from the perspective of individuals with limited scientific knowledge in order to minimise their prior chemical knowledge to influence their responses. There was much discussion of what was meant by the term ‘failure’ in the context of a chemical reaction and whether failure referred to the unsuccessful collisions at a molecular level or the absence of a product at the macroscopic level.

The students engaged in dialogues which helped refine the language they used in articulating their thoughts or helped them recognise thinking errors. This describes the final emergent theme of ‘Evidence of peer learning’. For example, when discussing their thought processes regarding a question in the ‘Deduction’ section of the WGCTA-S one student shared their strategy of having constructed mental Venn diagrams and had correctly identified how elements of the question related. This prompted others student to recognise the connection they had initially failed to make and reconsider their response.

DOT V3: internal reliability, criterion and discriminate validity

Table 4 summarises the demographic data according to education group. The distribution of sex representative of first year, third year undergraduates and postgraduates. The distribution of sex and age for the academics education group contained slightly more males (58%) and the median age (50) would suggest the majority of academics were in mid to late career. The mean ATAR score was reflective of the high admissions standards set by the two universities. Fewer postgraduates and academics provided an ATAR score as many of these participants may not have completed their secondary education in Australia or before the ATAR was introduced in 2009. 271 (91.6%) participants identified English as their preferred language.
Table 4 Demographic data of participants for the DOT V3
Education group Mean ATAR Female Male
First year 87.10 (n = 104) 54% (n = 64) 43% (n = 51)
Third year 90.03 (n = 55) 39% (n = 26) 61% (n = 41)
Postgraduates 87.35 (n = 19) 43% (n = 19) 57% (n = 25)
Academics 89.58 (n = 3) 38% (n = 15) 58% (n = 23)
Overall 88.23 (n = 181) 46% (n = 124) 52% (n = 140)

Internal reliability

The internal consistency of the DOT V3 using Cronbach's α was found to be acceptable (α = 0.71) which suggested the DOT V3 exhibited acceptable internal reliability (DeVellis, 2012, p. 109) and the sub-scales could confidently be added together to measure critical thinking skill. In order to generate further evidence with respect to the internal reliability, the sub-scales of the DOT V3 were determined to be suitable for factor analysis. The analysis revealed all correlations to be greater than 0.3 with the exception of the correlation between the ‘Developing Hypotheses’ and ‘Drawing Conclusions’ section (0.26). Principle factor analysis revealed the DOT V3 was unidimensional, with one factor explaining 50.31% of total variance and all sub-scales correlated with one factor (0.79–0.59), in this case, likely the overall score on the DOT V3. When factor analysis was compared on the sub-scales of the WGCTA it was also found to be unidimensional (Hassan and Madhum, 2007).

Criterion validity

Table 5 shows that significant differences in median scores were found between all education groups with the exception of those of postgraduates compared to those of the academics. Of particular interest was that medium (r > 0.3) and large (r > 0.5) effect sizes were obtained when comparing the median scores of first year students with third year students, postgraduates and academics. These findings provided strong evidence that the DOT V3 possessed good criterion validity when measuring the critical thinking skills of chemistry students up to and including post graduate level.
Table 5 Mann–Whitney U tests comparing the median scores obtained on the DOT V3 of each education group
  Education group
1st year 3rd year P’grad Academic
1st year (n = 119, Md = 16)   p < 0.001, r = 0.39 p < 0.001, r = 0.64 p < 0.001, r = 0.59
3rd year (n = 67, Md = 21) p < 0.001, r = 0.39   p < 0.001, r = 0.30 p = 0.003, r = 0.28
Postgraduates (n = 44, Md = 23.5) p < 0.001, r = 0.53 p < 0.001, r = 0.30   p = 0.691, r = 0.04
Academic (n = 40, Md = 24) p < 0.001, r = 0.59 p = 0.003, r = 0.28 p = 0.691, r = 0.04  

Interestingly, there appeared to be no statistically significant difference in DOT V3 scores when comparing postgraduates and academics. If the assumption that critical thinking skill is correlated positively to time spent in tertiary education environments is valid, it is likely that the DOT V3 was not sensitive enough to detect any difference in critical thinking skill between postgraduates and academics.

Discriminate validity

Several Mann–Whitney U tests were conducted to determine if sex was a significant predictor of performance on the DOT V3. These tests were conducted on the cohort as a whole and at an education group level (Table 6). The overall comparison of median scores would suggest that sex is a discriminate which affects performance on the DOT V3, favouring males. When viewed at the educational group level a statistically significant difference in performance is only observed at the first year level. Typically researchers do not find any differences between sex and test scores of other critical thinking tests (Hassan and Madhum, 2007; Butler, 2012). However, differences in performance on aptitude tests and argumentative writing tests based on the sex of the participants is not unheard of in the literature (Halpern et al., 2007; Preiss et al., 2013). Beyond the first year, sex appears not to be a discriminator of score on the DOT V3. Further evaluation of the test will be conducted on a larger sample size of first year students to see if the difference between sexes persists.
Table 6 Mann–Whitney U tests comparing the median score of the DOT V3 as determined by sex
Education group Sex
Female Male Significance
1st year n = 64, Md = 15 n = 51, Md = 18 p = 0.007, r = 0.25
3rd year n = 26, Md = 20.5 n = 41, Md = 21 p = 0.228, r = 0.15
Postgraduate n = 19, Md = 24 n = 25, Md = 23 p = 0.896, r = 0.02
Academic n = 15, Md = 24 n = 23, Md = 22 p = 0.904, r = 0.02

Using Spearman's Rank-order correlation coefficient, there was a weak, positive correlation between DOT V3 score and ATAR score (ρ = 0.20, n = 194, p = 0.01), suggesting previous achievement was only a minor dependent with respect to score on the DOT V3. This correlation was in line with previous observations collected during testing of the DOT V2 where the correlation between previous academic achievement and performance on the test were found to have a small correlation (ρ = 0.23) but the sample size was small (n = 15). However, as the sample size used in the study of this relationship in the DOT V3 was much larger (n = 194) these findings suggested performance on the DOT V3 was only slightly correlated to previous academic achievement.

In order to determine the validity of the DOT V3 outside of Monash University the median scores of third year students from Monash University and Curtin University were compared using a Mann–Whitney U Test. The test revealed no significant difference in the score obtained by Monash University students (Md = 20, n = 44) and Curtin University students (Md = 22, n = 24), U = 670.500, z = 1.835, p = 0.07, r = 0.22. Therefore, the score obtained on the DOT V3 was considered independent of where the participant attended university. It is possible that an insufficient number of tests were completed, due to the opportunistic sampling from both universities, and obtaining equivalent sample sizes across several higher education institutions would confirm whether the DOT V3 performs well across higher education institutions.


A chemistry critical thinking test (DOT test), which aimed to evaluate the critical thinking skills of undergraduate chemistry students at any year level irrespective of the students’ prior chemistry knowledge was developed. Test development included a pilot study (DOT P) and three iterations of the test (DOT V1, DOT V2 and DOT V3). Throughout test development the various iterations of the test reliability and validity studies were conducted. The findings from these studies demonstrate that the most recent version of the test, DOT V3, had good internal reliability. Determination of discriminate validity showed that the DOT V3 score was independent of previous academic achievement (as measured by ATAR score). Additionally, performance on the DOT V3 was found to be independent of whether the participant attended Monash or Curtin University. The test was found to exhibit strong criterion validity by comparing the median scores of first year undergraduate, third year undergraduate and postgraduate chemistry students, and academics from an online community of practice. However, the DOT V3 was unable to distinguish between postgraduate and academic participants. Additionally, qualitative studies conducted throughout test development highlights the inherit difficulty of assessing critical thinking independent of context as participants expressed difficulty restricting the use of their prior chemical knowledge when responding to the test. This qualitative finding lends evidence to the specifist perspective that critical thinking cannot truly be independent of context.

The DOT V3 offers a tool with which to measure a student's critical thinking skills and the effect of any teaching interventions specifically targeting the development of critical thinking. The test is suitable for studying the development of critical thinking using a cross section of students, and may be useful in longitudinal studies of a single cohort. In summary, research conducted within this study provides a body of evidence regarding reliability and validity of the DOT test, and it offers the chemistry education community a valuable research and educational tool with respect to the development of undergraduate chemistry students’ critical thinking skills. The DOT V3 is included in Appendix 1 (ESI) (Administrator guidelines and intended responses to the DOT V3 can be obtained upon email correspondence with the first author).

Implications for practice

Using the DOT V3, it may be possible to evaluate the development of critical thinking across a degree program, like much of the literature which has utilised commercially available tests (Carter et al., 2015). Using the DOT V3 it may be possible to obtain baseline data regarding the critical thinking skills of students and use this data to inform teaching practices aimed at developing critical thinking skills of students in subsequent years.

Whilst there is the potential to measure the development of critical thinking over a semester using the DOT V3, there is evidence to suggest that a psychological construct, such as critical thinking, does not develop enough for measureable differences to occur in the space of only a semester (Pascarella, 1999). While the DOT V3 could be administered to the same cohort of students annually to form the basis of a longitudinal study, there are many hurdles to overcome in such a study including participant retention and their developing familiarity with the test. Much like the CCTST and the WGCTA pre- and post-testing (Jacobs, 1999; Carter et al., 2015), at least two versions of the DOT V3 may be required for pre- and post-testing and for longitudinal studies. However, having a larger pool of questions does not prevent the participants from becoming familiar with the style of critical thinking questions. Development of an additional test would require further reliability and validity testing (Nunnally and Bernstein, 1994). However, cross-sectional studies are useful in identifying changes in critical thinking skills and the DOT V3 has demonstrated it is sensitive enough to discern between the critical thinking skills of first year or third year undergraduate chemistry students.

Conflicts of interest

There are no conflicts to declare.


The authors would like to acknowledge participants from Monash University, Curtin University and academics from the community of practice who took the time to complete the various versions of the DOT test and/or participate in the focus groups. This research was made possible through the Australian Post-graduate Award funding and with guidance of the Monash University Human Ethics Research Committee.


  1. Abrami P. C., Bernard R. M., Borokhovski E., Wade A., Surkes M. A., Tamim R. and Zhang D., (2008), Instructional interventions affecting critical thinking skills and dispositions: a stage 1 meta-analysis, Rev. Educ. Res., 78(4), 1102–1134.
  2. Abrami P. C., Bernard R. M., Borokhovski E., Waddington D. I., Wade C. A. and Persson T., (2015), Strategies for teaching students to think critically: a meta-analysis, Rev. Educ. Res., 85(2), 275–314.
  3. AssessmentDay Ltd, (2015), Watson Glaser critical thinking appraisal, retrieved from, accessed on 03/07/2015.
  4. Bailin S., (2002), Critical thinking and science education, Sci. Educ., 11, 361–375.
  5. Barnett R., (1997), Higher education: a critical business, Buckingham: Open University Press.
  6. Behar-Horenstein L. S. and Niu L., (2011), Teaching critical thinking skills in higher education: a review of the literature, J. Coll. Teach. Learn., 8(2), 25–41.
  7. Bryman A., (2008), Social research methods, 3rd edn, Oxford: Oxford University Press.
  8. Butler H. A., (2012), Halpern critical thinking assessment predicts real-world outcomes of critical thinking, Appl. Cognit. Psychol., 26(5), 721–729.
  9. Carter A. G., Creedy D. K. and Sidebotham M., (2015), Evaluation of tools used to measure critical thinking development in nursing and midwifery undergraduate students: a systematic review, Nurse Educ. Today, 35(7), 864–874.
  10. Cronbach L. J., (1951), Coefficient alpha and the internal structure of tests, Psychometrika, 16, 297–334.
  11. Danczak S. M., Thompson C. D. and Overton T. L., (2017), What does the term critical thinking mean to you? A qualitative analysis of chemistry undergraduate, teaching staff and employers' views of critical thinking, Chem. Educ. Res. Pract., 18(3), 420–434.
  12. Davies M., (2013), Critical thinking and the disciplines reconsidered, High. Educ. Res. Dev., 32(4), 529–544.
  13. Desai M. S., Berger B. D. and Higgs R., (2016), Critical thinking skills for business school graduates as demanded by employers: a strategic perspective and recommendations, Acad. Educ. Leadership J., 20(1), 10–31.
  14. DeVellis, R. F., (2012), Scale development: Theory and applications, 3rd edn, Thousand Oaks, CA: Sage.
  15. Dressel P. L. and Mayhew L. B., (1954), General education: explorations in evaluation, Washington, DC: American Council on Eduction.
  16. Edwards D., Perkins K., Pearce J. and Hong J., (2015), Work intergrated learning in STEM in Australian universities, retrieved from, accessed on 05/12/2016.
  17. Ennis R. H., (1989), Critical thinking and subject specificity: clarification and needed research, Educ. Res., 18(3), 4–10.
  18. Ennis R. H., (1990), The extent to which critical thinking is subject-specific: further clarification, Educ. Res., 19(4), 13–16.
  19. Ennis R. H., (1993), Critical thinking assessment, Theory Into Practice, 32(3), 179–186.
  20. Ennis R. H. and Weir E., (1985), The ennis-weir critical thinking essay test: test, manual, criteria, scoring sheet, retrieved from, accessed on 09/10/2017.
  21. Facione P. A., (1990), Critical thinking: a statement of expert consensus for purposes of educational assessment and instruction. Executive summary. “The Delphi report”, Millbrae, CA: T. C. A. Press.
  22. Ferguson R. L., (2007), Constructivism and social constructivism, in Bodner G. M. and Orgill M. (ed.), Theoretical frameworks for research in chemistry and science education, Upper Saddle River, NJ: Pearson Education (US).
  23. Flynn A. B., (2011), Developing problem-solving skills through retrosynthetic analysis and clickers in organic chemistry, J. Chem. Educ., 88, 1496–1500.
  24. Garratt J., Overton T. and Threlfall T., (1999), A question of chemistry, Essex, England: Pearson Education Limited.
  25. Garratt J., Overton T., Tomlinson J. and Clow D., (2000), Critical thinking exercises for chemists, Active Learn. High. Educ., 1(2), 152–167.
  26. Glaser R., (1984), Education and thinking: the role of knowledge, Am. Psychol., 39(2), 93–104.
  27. Gupta T., Burke K. A., Mehta A. and Greenbowe T. J., (2015), Impact of guided-inquiry-based instruction with a writing and reflection emphasis on chemistry students’ critical thinking abilities, J. Chem. Educ., 92(1), 32–38.
  28. Halpern D. F., (1993), Assessing the effectiveness of critical thinking instruction, J. General Educ., 50(4), 238–254.
  29. Halpern D. F., (1996a), Analyzing arguments, in Halpern D. F. (ed.), Thought and knowledge: an introduction to critical thinking, 3rd edn, Mahwah, NJ: L. Erlbaum Associates, pp. 167–211.
  30. Halpern D. F., (1996b), Thought and knowledge: An introduction to critical thinking, 3rd edn, Mahwah, NJ: L. Erlbaum Associates.
  31. Halpern D. F., (1998), Teaching critical thinking for transfer across domains. Dispositions, skills, structure training, and metacognitive monitoring, Am. Psychol., 53, 449–455.
  32. Halpern D. F., (2016), Manual: Halpern critical thinking assessment, retrieved from, accessed on 09/10/2017.
  33. Halpern D. F., Benbow C. P., Geary D. C., Gur R. C., Hyde J. S. and Gernsbacher M. A., (2007), The science of sex differences in science and mathematics, Psychol. Sci. Public Interest, 8(1), 1–51.
  34. Hassan K. E. and Madhum G., (2007), Validating the Watson Glaser critical thinking appraisal, High. Educ., 54(3), 361–383.
  35. Heijltjes A., van Gog T., Leppink J. and Pass F., (2015), Unraveling the effects of critical thinking instructions, practice, and self-explanation on students' reasoning performance, Instr. Sci., 43, 487–506.
  36. Henderson D. E., (2010), A chemical instrumentation game for teaching critical thinking and information literacy in instrumental analysis courses, J. Chem. Educ., 87, 412–415.
  37. Huber C. R. and Kuncel N. R., (2016), Does college teach critical thinking? A meta-analysis, Rev. Educ. Res., 86(2), 431–468.
  38. Inhelder B. and Piaget J., (1958), The growth of logical thinking from childhood to adolescence: an essay on the construction of formal operational structures, London: Routledge & Kegan Paul.
  39. Insight Assessment, (2013), California critical thinking skills test (CCTST), Request information, retrieved from, accessed on 07/09/2017.
  40. Iwaoka W. T., Li Y. and Rhee W. Y., (2010), Measuring gains in critical thinking in food science and human nutrition courses: the Cornell critical thinking test, problem-based learning activities, and student journal entries, J. Food Sci. Educ., 9(3), 68–75.
  41. Jackson D., (2010), An international profile of industry-relevant competencies and skill gaps in modern graduates, Int. J. Manage. Educ., 8(3), 29–58.
  42. Jacob C., (2004), Critical thinking in the chemistry classroom and beyond, J. Chem. Educ., 81(8), 1216–1223.
  43. Jacobs S. S., (1999), The equivalence of forms a and b of the California critical thinking skills test, Meas. Eval. Counsel. Dev., 31(4), 211–222.
  44. Johnson R. H., Blair J. A. and Hoaglund J., (1996), The rise of informal logic: essays on argumentation, critical thinking, reasoning, and politics, Newport, VA: Vale Press.
  45. Klein G. C. and Carney J. M., (2014), Comprehensive approach to the development of communication and critical thinking: bookend courses for third- and fourth-year chemistry majors, J. Chem. Educ., 91, 1649–1654.
  46. Kline T., (2005), Psychological testing: A practical approach to design and evaluation, Thousand Oaks, CA: Sage Publications.
  47. Kogut L. S., (1993), Critical thinking in general chemistry, J. Chem. Educ., 73(3), 218–221.
  48. Krejcie R. V. and Morgan D. W., (1970), Determining sample size for research activities, Educ. Psychol. Meas., 30(3), 607–610.
  49. Kuhn D., (1999), A developmental model of critical thinking, Educ. Res., 28(2), 16–26.
  50. Kuhn D., (2000), Metacognitive development, Curr. Dir. Psychol. Sci., 9(5), 178–181.
  51. Lehman D. R. and Nisbett R. E., (1990), A longitudinal study of the effects of undergraduate training on reasoning, Dev. Psychol., 26, 952–960.
  52. Lehman D. R., Lempert R. O. and Nisbett R. E., (1988), The effects of graduate training on reasoning: formal discipline and thinking about everyday-life events, Am. Psychol., 43, 431–442.
  53. Lindsay E., (2015), Graduate outlook 2014 employers' perspectives on graduate recruitment in Australia, Melbourne: Graduate Careers Australia, retrieved from, accessed on 21/08/2015.
  54. Lowden K., Hall S., Elliot D. and Lewin J., (2011), Employers’ perceptions of the employability skills of new graduates: research commissioned by the edge foundation, retrieved from, accessed on 06/12/2016.
  55. Martineau E. and Boisvert L., (2011), Using wikipedia to develop students' critical analysis skills in the undergraduate chemistry curriculum, J. Chem. Educ., 88, 769–771.
  56. McMillan J., (1987), Enhancing college students' critical thinking: a review of studies, J. Assoc. Inst. Res., 26(1), 3–29.
  57. McPeak J. E., (1981), Critical thinking and education, Oxford: Martin Roberston.
  58. McPeak J. E., (1990), Teaching critical thinking: dialogue and dialectic, New York: Routledge.
  59. Monash University, (2015), Undergraduate – area of study. Chemistry, retrieved from, accessed on 15/04/2015.
  60. Moore T. J., (2011), Critical thinking and disciplinary thinking: a continuing debate, High. Educ. Res. Dev., 30(3), 261–274.
  61. Moore T. J., (2013), Critical thinking: seven definitions in search of a concept, Stud. High. Educ., 38(4), 506–522.
  62. Nisbett R. E., Fong G. T., Lehman D. R. and Cheng P. W., (1987), Teaching reasoning, Science, 238, 625–631.
  63. Nunnally J. C. and Bernstein I. H., (1994), Psychometric theory, New York: McGraw-Hill.
  64. OECD, (2014), Pisa 2012 results: creative problem solving: students' skills in tackling real-life problems (volume v), OECD Publishing, retrieved from, accessed on 05/01/2018.
  65. Oliver-Hoyo M. T., (2003), Designing a written assignment to promote the use of critical thinking skills in an introductory chemistry course, J. Chem. Educ., 80, 899–903.
  66. Ontario University, (2017), Appendix 1: OCAV's undergraduate and graduate degree level expectations, retrieved from, accessed on 09/10/2017.
  67. Pallant J. F., (2016), SPSS survival manual, 6th edn, Sydney: Allen & Unwin.
  68. Pascarella E., (1999), The development of critical thinking: Does college make a difference? J. Coll. Stud. Dev., 40(5), 562–569.
  69. Pearson, (2015), Watson-Glaser critical thinking appraisal – short form (WGCTA-S), retrieved from, accessed on 03/07/2015.
  70. Phillips V. and Bond C., (2004), Undergraduates' experiences of critical thinking, High. Educ. Res. Dev., 23(3), 277–294.
  71. Pithers R. T. and Soden R., (2000), Critical thinking in education: a review, Educ. Res., 42(3), 237–249.
  72. Preiss D. D., Castillo J., Flotts P. and San Martin E., (2013), Assessment of argumentative writing and critical thinking in higher education: educational correlates and gender differences, Learn. Individ. Differ., 28, 193–203.
  73. Prinsley R. and Baranyai K., (2015), STEM skills in the workforce: What do employers want? retrieved from, accessed on 06/10/2015.
  74. Richardson M., Abraham C. and Bond R., (2012), Psychological correlates of university students' academic performance: a systematic review and meta-analysis, Psychol. Bull., 138(2), 353–387.
  75. Sarkar M., Overton T., Thompson C. and Rayner G., (2016), Graduate employability: views of recent science graduates and employers, Int. J. Innov. Sci. Math. Educ., 24(3), 31–48.
  76. Stephenson N. S. and Sadler-Mcknight N. P., (2016), Developing critical thinking skills using the science writing heuristic in the chemistry laboratory, Chem. Educ. Res. Pract., 17(1), 72–79.
  77. The Critical Thinking Co, (2017), Cornell critical thinking tests, retrieved from, accessed on 9/10/2017.
  78. Thorndike E. L. and Woodworth R. S., (1901a), The influence of improvement in one mental function upon the efficiency of other functions, (i), Psychol. Rev., 8(3), 247–261.
  79. Thorndike E. L. and Woodworth R. S., (1901b), The influence of improvement in one mental function upon the efficiency of other functions. ii. The estimation of magnitudes, Psychol. Rev., 8(4), 384–395.
  80. Thorndike E. L. and Woodworth R. S., (1901c), The influence of improvement in one mental function upon the efficiency of other functions: functions involving attention, observation and discrimination, Psychol. Rev., 8(6), 553–564.
  81. Tiruneh D. T., Verburgh A. and Elen J., (2014), Effectiveness of critical thinking instruction in higher education: a systematic review of intervention studies, High. Educ. Stud., 4(1), 1–17.
  82. Tiruneh D. T., De Cock M., Weldeslassie A. G., Elen J. and Janssen R., (2016), Measuring critical thinking in physics: development and validation of a critical thinking test in electricity and magnetism, Int. J. Sci. Math. Educ., 1–20.
  83. Tsai C. C., (2001), A review and discussion of epistemological commitments, metacognition, and critical thinking with suggestions on their enhancement in internet-assisted chemistry classrooms, J. Chem. Educ., 78(7), 970–974.
  84. University of Adelaide, (2015), University of Adelaide graduate attributes, retrieved from, accessed on 15/04/2015.
  85. University of Edinburgh, (2017), The University of Edinburgh's graduate attributes, retrieved from, accessed on 09/10/2017.
  86. University of Melbourne, (2015), Handbook – chemistry, retrieved from!R01-AA-MAJ%2B1007, accessed on 15/04/2015.
  87. Watson G. and Glaser E. M., (2006), Watson-Glaser critical thinking appraisal short form manual, San Antonio, TX: Pearson.


Electronic supplementary information (ESI) available. See DOI: 10.1039/c8rp00130h

This journal is © The Royal Society of Chemistry 2020