Jared
Breakall
*,
Christopher
Randles
and
Roy
Tasker
Department of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana, USA. E-mail: jbreakal@purdue.edu
First published on 28th January 2019
Multiple-choice (MC) exams are common in undergraduate general chemistry courses in the United States and are known for being difficult to construct. With their extensive use in the general chemistry classroom, it is important to ensure that these exams are valid measures of what chemistry students know and can do. One threat to MC exam validity is the presence of flaws, known as item writing flaws, that can falsely inflate or deflate a student's performance on an exam, independent of their chemistry knowledge. Such flaws can disadvantage (or falsely advantage) students in their exam performance. Additionally, these flaws can introduce unwanted noise into exam data. With the numerous possible flaws that can be made during MC exam creation, it can be difficult to recognize (and avoid) these flaws when creating MC general chemistry exams. In this study a rubric, known as the Item Writing Flaws Evaluation Instrument (IWFEI), has been created that can be used to identify item writing flaws in MC exams. The instrument was developed based on a review of the item writing literature and was tested for inter-rater reliability using general chemistry exam items. The instrument was found to have a high degree of inter-rater reliability with an overall percent agreement of 91.8% and a Krippendorff Alpha of 0.836. Using the IWFEI in an analysis of 1019 general chemistry MC exam items, it was found that 83% of items contained at least one item writing flaw with the most common flaw being the inclusion of implausible distractors. From the results of this study, an instrument has been developed that can be used in both research and teaching settings. As the IWFEI is used in these settings we envision an improvement in MC exam development practice and quality.
With the improvement of the teaching process being fundamentally connected with assessment, the quality of assessment has been an area of focus in the chemical education community since the 1920s (Bretz, 2013). Research into the assessment of student learning is diverse as represented by publications in Table 1. From this sizeable but not exhaustive list, we can see that improving assessment practice and quality has been a focus of those interested in bettering chemistry instruction.
Research topic | Ref. |
---|---|
Investigating faculty familiarity with assessment terminology | (Raker et al., 2013a; Raker and Holme, 2014) |
Surveying chemistry faculty about their assessment efforts | (Emenike et al., 2013) |
Developing two-tier multiple-choice instruments | (Chandrasegaran et al., 2007), |
Analyzing ACS exam item performance | (Schroeder et al., 2012; Kendhammer et al., 2013) |
Investigating teachers use of assessment data | (Harshman and Yezierski, 2015) |
Studying content coverage based on administered ACS exams | (Reed et al., 2017) |
Creation of exam data analysis tools | (Brandriet and Holme, 2015) |
Using Rasch modeling to understand student misconceptions and multiple-choice item quality | (Herrmann-Abell and DeBoer, 2011) |
Development of instruments that can be used to measure the cognitive complexity of exam items | (Knaus et al., 2011; Raker et al., 2013b) |
Aligning exam items with the Next Generation Science Standards | (Cooper, 2013; Laverty et al., 2016; Reed et al., 2016) |
Creating exams that discourage guessing | (Campbell, 2015) |
Outlining how to develop assessment plans | (Towns, 2010) |
Self-constructed assessments, particularly multiple-choice (MC) assessments, are popular in undergraduate chemistry courses. In a recent survey, 93.2% of 1282 chemistry instructors from institutions that confer bachelor's degrees in chemistry, reported creating their own assessments (Gibbons et al., 2018). Furthermore, in a survey of 2750 biology, chemistry, and physics instructors from public and private US institutions, 56% of chemistry instructors reported using MC exams in some or all of their courses (Goubeaud, 2010). A practical reason for the popularity of MC assessments is their ease and reliability of grading for large groups of students (Thorndike and Thorndike-Christ, 2010).
Although MC exams tend to have high grading reliability, creating valid MC items that perform reliably is difficult and requires skill to do properly (Pellegrino, 2001). Against this background, chemistry instructors typically have little to no formal training in appropriate assessment practices (Lawrie et al., 2018). In addition to a lack of training, another reason creating MC assessments can be difficult is because there are numerous ways to lessen an exam's validity based on how it is designed. When designing MC exams, it is important to follow appropriate design recommendations to increase the exams validity. Many of these recommendations are outlined in the literature as item writing guidelines (Haladyna et al., 2010).
Disregarding these guidelines can introduce variance in student performance that stems from something other than the student's knowledge of what is being tested (Downing, 2002). This is known as construct irrelevant variance (CIV) (Thorndike and Thorndike-Christ, 2010). Examples of CIV include students using test-taking strategies, teaching to the test, and cheating (Downing, 2002).
CIV in the form of testwiseness, which is defined as a student's capacity to utilize the characteristics and formats of a test and/or the test taking situation to receive a high score (Millman and Bishop, 1965), has been studied in students answering MC chemistry exam items (Towns and Robinson, 1993). Although it was found that students do use testwise strategies when answering exam items (Towns and Robinson, 1993), these exams can be written so as to reduce CIV in the form of testwiseness (Towns, 2014).
MC items can be evaluated in several ways which can help a test creator make judgements about how an item functions and what performance on that item means about student learning. Two common quantitative measures of MC items are the difficulty and discrimination indices. Item difficulty, which is typically reported as a percentage, is the proportion of students who got an item correct over the total number of students who attempted the item. This can give an idea of how hard or easy an item is for the students tested. Item discrimination, which can be measured in several ways, is a measure of how well an item separates students based on ability. The higher the discrimination value the better the item can differentiate between students of different ability levels. These measures, although they cannot tell you all things about an items quality, can give an idea of how the item functions for the students tested.
Guidelines for how to write MC items have been published (Frey et al., 2005; Moreno and Martı, 2006; Haladyna et al., 2010; Towns, 2014). Following or disregarding these guidelines can affect how students perform on exams (Downing, 2005).
Neglecting to adhere to item writing guidelines can affect exam performance in ways that are significant to both instructors and students. For example, in a 20 item MC general chemistry exam, each item is worth five percent of a student's exam score. Therefore, if even a few of the items are written in a way where a student either answers correctly or incorrectly based on features of the item instead of their understanding of the chemistry, this can unjustifiably inflate or deflate the student's exam grade by 10–20%. This has been demonstrated in a study where flawed items may have led to the incorrect classification of medical students as failed when they should have been classified as passing (Downing, 2005).
Many item writing guidelines exist, and they apply to five main aspects of MC exams including, preparing the exam, the overall item, the stem, the answer set, and the exam as a whole. These guidelines will now be outlined.
Additionally, MC items need to be written succinctly (Haladyna et al., 2010). If an item is not succinctly written, unwanted CIV can be introduced. In a study of overly wordy accounting items, average student performance was worse when compared to more succinct versions of items testing the same content (Bergner et al., 2016). This may be because overly wordy items can decrease reliability, make items more difficult, and increase the time it takes to complete an exam (Rimland, 1960; Board and Whitney, 1972; Schrock and Mueller, 1982; Cassels and Johnstone, 1984). This may be because if an item is overly wordy, it can be testing reading ability or constructs other than the chemistry concepts being tested. Fig. 1 shows examples of wordy and succinct items.
Fig. 1 Example from (Towns, 2014) wordy item left; succinct item right. https://pubs.acs.org/doi/abs/10.1021/ed500076x further permissions related to this figure should be directed to the ACS. |
Items should be free from grammatical and phrasing cues (Haladyna and Steven, 1989). These are inconsistences in the grammar or phrasing of an item that can influence how a student interprets, understands, and answers the item (Haladyna et al., 2010). These can include associations between the stem and the answer choices that may cue a student to the correct answer. Grammatical inconsistencies can provide cues to students that can affect the psychometric properties of the exam item (Dunn and Goldstein, 1959; Plake, 1984; Weiten, 1984). See Fig. 2 for examples of grammatically inconsistent and grammatically consistent items. In the grammatically inconsistent item in Fig. 2, the answer choice (a) “one” is not consistent with the stem in that it would read, “Carbon has one protons”. This may cue students into the fact that this may not be the correct answer choice. Therefore, MC items should be written without grammatical or phrasing cues.
Fig. 2 Grammatically inconsistent item left; grammatically consistent item right (examples from Appendix 1). |
Items that contain multiple combinations of answer choices (known as K-type items) should be avoided when constructing MC exams (Albanese, 1993; Downing et al., 2010). An example of this format is shown in Fig. 3. K-type items have been shown to contain cueing that decreases reliability and inflates test scores when compared to multiple true-false (select all that are true) items (Albanese et al., 1979; Harasym et al., 1980). This is likely due to the fact that examinees can easily eliminate answer choices by determining the validity of only a few responses (Albanese, 1993). Therefore, K-type items should be avoided as they can contain cues that decrease reliability and increase test scores.
Items should be kept to an appropriate level of cognitive demand. Although there are several ways to determine cognitive demand, including thorough and effective instruments such as the cognitive complexity rating tool (Knaus et al., 2011), one simple guideline is that an item should not require more than six ‘thinking steps’ to answer. A thinking step is defined by Johnstone as a thought or process that must be activated to answer the question (Johnstone and El-Banna, 1986). If more than six thinking steps are involved in an item, student performance can decrease sharply due to working memory overload (Johnstone and El-Banna, 1986; Johnstone, 1991, 2006; Tsaparlis and Angelopoulos, 2000). This working memory overload may interfere with the student's ability to demonstrate their understanding of the concept(s) being tested (Johnstone and El-Banna, 1986). Furthermore, in studies of M-demand (defined as the maximum number of steps that a subject must activate simultaneously in the course of executing a task), it has been shown that student performance decreases as the M-demand of an item increases (Niaz, 1987, 1989; Hartman and Lin, 2011). Although, the number of thinking steps in an item can be difficult to determine due to chunking, and varying ability levels, keeping the general number of thinking steps in an item to six or less can help ensure that an item is assessing understanding of chemistry and not cognitive capabilities.
Fig. 4 Unfocused (left) focused (right) reproduced from (Dudycha and Carpenter, 1973) with permission. |
Positively worded items are recommended over negatively worded items (Haladyna and Rodriguez, 2013). This is because negative phrasing in an item can increase reading and reasoning difficulty, introduce cueing, and that testing a student's understanding of an exception is not always the same as testing their understanding of the actual learning objective (Cassels and Johnstone, 1984; Harasym et al., 1992; Thorndike and Thorndike-Christ, 2010). If an item must be worded negatively to test a desired skill or construct, the negative phrase should be highlighted (Haladyna et al., 2010). Negatively phrased items were shown to be more difficult when the negative phrase was not emphasized (Casler, 1983). This may be because highlighting the negative phrase minimizes the risk that a student will miss the negative phrase while reading the item. Therefore, if negative phrasing must be used, it is recommended that it is emphasized.
The second answer choice set creation guideline is that ‘all of the above’ should be avoided (Haladyna et al., 2002; Downing et al., 2010). This is because its use has been shown to significantly enhances student performance on MC items due to an inherent cueing effect (Harasym et al., 1998). This improvement in student performance is likely due to students being able to more easily eliminate or select ‘all of the above’ when compared to items in a ‘select all that are true’ format (Harasym et al., 1998).
Third, answer choices should be arranged in a logical order (Haladyna and Steven, 1989; Moreno and Martı, 2006). For example, in ascending or descending numerical order. In a study that found higher discrimination values on items with randomly ordered answer choices verses ones that were logically ordered, it was concluded that although answer choice order is not likely an influencing factor for higher-ability students, it may affect lower-ability students in their exam performance (Huntley and Welch, 1993). Additionally, arranging answer choices in logical or numerical order was found to be unanimously supported in a review of measurement textbooks (Haladyna et al., 2010).
The fourth recommendation is that the answer choices should be kept to approximately the same length/level of detail (Frey et al., 2005; Haladyna et al., 2010; Towns, 2014). This is because students can use the ‘choose the longest answer’ heuristic when taking an examination. Students may believe that if an answer choice is significantly longer or provides more detail than the other answer choices, then it is more likely to be correct. Research has shown that items with inconsistent length of answer choices are easier and less discriminating than items where the length of the answer choices are approximately equal (Dunn and Goldstein, 1959; Weiten, 1984).
Additionally, placing three or more difficult items next to each other is also poor practice in MC exam creation. This has been found to decrease performance on general chemistry items in ACS exams (Schroeder et al., 2012). Possible causes of this effect may be self-efficacy or exam fatigue (Galyon et al., 2012; Schroeder et al., 2012)
An exam should have an approximately even distribution of correct answer choices, a practice known as key balancing (Towns, 2014). This is because in unbalanced exams, it has been shown that test makers and test takers have a tendency to choose the middle options in a MC item for the correct response (Attali, 2003). This produces a bias that effects the psychometric properties of an exam by making items with middle keyed answer choices easier and less discriminating (Attali, 2003). Additionally, test takers tend to expect different answers on subsequent items when they see a “run” of answer choices even though this is not necessarily true (Lee, 2018). Thus, it is good practice to ensure that there is an approximately even distribution of correct answer choices.
An exam should not link performance on one item with performance on another (Haladyna et al., 2002; Moreno and Martı, 2006; Hogan and Murphy, 2007). All items should be independent from each other so that a student has a fair chance to answer each item correctly.
With this said, linked “two-tier” multiple-choice items have been used to assess student reasoning by asking them to explain, or choose an explanation to, their answer choice in a previous item (Tan et al., 2002; Chandrasegaran et al., 2007). While this practice has merit when used with purpose, simply linking items that test chemistry content can disadvantage students who incorrectly answer the first item.
Discipline | Ref. | Sample size | Frequency of flawed items | Most common flaws |
---|---|---|---|---|
Nursing | (Tarrant et al., 2006) | 2770 MC items | 46.2% | Unclear stem 7.5% |
Negative stem 6.9% | ||||
Implausible distractors 6.6% | ||||
(Tarrant and Ware, 2008) | 664 MC items | 47.3% | Unfocused stem 17.5% | |
Negative Stem 13% | ||||
Overly wordy 12.2% | ||||
Pharmacy | (Pate and Caldwell, 2014) | 187 MC items | 51.8% | All of the above 12.8% |
Overly wordy 12.8% | ||||
K-type 10.2% | ||||
Medical Education | (Stagnaro-Green and Downing, 2006) | 40 MC items | 100% | Phrasing Cues 100% |
Unfocused stem 100% | ||||
Overly wordy 73% |
• Represent common item writing guidelines
• Be able to identify item writing guideline violations in MC items
• Be able to be used reliably between raters
• Be able to be used with a minimal amount of training
Additionally, although adherence to item writing guidelines has been a topic of research in other academic disciplines (Tarrant et al., 2006; Pate and Caldwell, 2014), there have been few research studies that have addressed the extent that general chemistry exams adhere to MC item writing guidelines. Therefore, once developed, the instrument was used to evaluate the adherence of 43 general chemistry exams (1019 items) to item writing guidelines.
Rater | Position | Discipline | Years of experience teaching general chemistry | Number of general chemistry MC exams created |
---|---|---|---|---|
1 (1st author) | Graduate Student | Chemical Education | 0–2 | 1–4 |
2 | Assistant Professor | Inorganic | 2–4 | 1–4 |
3 | Instructor | Chemical Education | 10+ | 20+ |
Raters 2 and 3 (Table 3) received a 20 minute orientation (from rater 1) on how to use the IWFEI and then raters 1–3 used the instrument to individually rate 10 general chemistry MC exam items. The 10 items were chosen because they contained a variety of item writing guideline violations that were within what the IWFEI intended to test. These items are found in Appendix 2.
The chemistry exams analyzed were from a 1st semester general chemistry course for scientists and engineers at a public, R1 university in the midwestern United States. The exams were created by committees of instructors who taught different sections of the same course. A total of 33 unit exams and 10 final exams were analyzed. A unit exam is given during a semester and covers a subset of the course content. Typically, a unit exam covers approximately a fourth of the courses material. A final exam is typically cumulative and given at the end of the semester. The exams included topics common in first semester general chemistry such as: stoichiometry, gas laws, light and energy, radioactivity, periodic trends, thermochemistry, Lewis structures, polarity of molecules, intermolecular forces, etc. The items contained various levels of representations including chemical formula, diagrams, molecular level representations, graphs, etc. If a representation in an item (such as chemical formulas) affected how to interpret a criterion in the IWFEI, it was noted in appendix 1 for the use of the instrument. The exams were administered between 2011 and 2016 and were composed of MC items exclusively. Unit and final exams contained 20 items and approximately 40 items, respectively. A total of 1019 items were evaluated.
The two raters first evaluated 400 items individually. Then they discussed discrepancies in ratings among those 400 items. These discussions led to the modification of criteria wording and definitions in the IWFEI. Once modifications were made, the initial 400 items were re-evaluated using the updated instrument. Lastly, the remaining items in the data set were evaluated using the IWFEI.
Percent agreement and Krippendorff alpha statistics were then calculated between the two raters using the IRR package in R software. The raters then reached consensus on any disagreements. Once consensus was reached, percentages of how many items adhered to the various item writing guidelines in the instrument were calculated.
This study was approved by the institutional review board at the institution where the study took place.
Criterion 10, ‘Are all answer choices plausible?’ was given two definitions. Definition 10a being ‘All distractors need to be made with student errors or misconceptions.’ This definition may have utility as a reflective tool when the IWFEI is being used to analyze one's own exam before it is administered, yet it was shown to have poor reliability when used to analyze exam items made by other instructors. Definition 10b is based on item statistics and defines an implausible distractor as one that fewer than 5% of students selected. This definition, (10b) should be used when using the IWFEI to evaluate historical exam items made by other instructors.
When the initial inter-rater reliability was calculated, a 78.2% agreement was found between the three raters with a 0.725 Krippendorff alpha. This suggests that there was a substantial level of agreement between the raters (Landis and Koch, 1977). Furthermore, when looking at the agreement between individual raters in Table 3, raters 1 and 2, 2 and 3, and 1 and 3 had a percent agreement of 83.3, 87.5, and 85.0%, respectively. Because all raters in Table 3 (less experienced and more experienced) agreed above an 80% level, we moved forward with the item analysis phase of the study. During that phase, when the final version of the instrument was used by the 1st and 2nd authors to rate the 1019 items (10458 total ratings) it had a 91.8% agreement and a 0.836 Krippendorff Alpha. As these reliability statistics were both above 80% and 0.80, respectively the authors decided this was an acceptable level of agreement.
Criterion 9 was used to analyze a subset of 96 items from the fall of 2016, instead of all 1019 items, because of the lack of availability of learning objectives from previous semesters. Criterion 10 (using the 10a definition) had a consistently low level of reliability at 68.9%. Criteria 12–15 were rated 43 times, instead of 1019 times, because they apply only once per exam being analyzed.
The inter-rater reliability statistics for the individual criteria of the IWFEI as used across the 1019 items are found in Table 4.
Criteria | Guidelines | Percent agreement (%) |
---|---|---|
1 | Is the test item clear and succinct? | 93.9 |
2 | If the item uses negative phrasing such as “not” or “except”, is the negative phrase bolded? | 97.0 |
3 | If the answer choices are numerical, are they listed in ascending or descending order? | 95.9 |
4 | If the answer choices are verbal, are they all approximately the same length? | 90.8 |
5 | Does the item avoid “all of the above” as a possible answer choice? | 96.9 |
6 | Does the item avoid grammatical and phrasing cues? | 98.4 |
7 | Could the item be answered without looking at the answer choices? | 85.4 |
8 | Does the item avoid complex K-type item format? | 97.5 |
9 | Is the item linked to one or more objectives of the course? | 84.4 |
10a | Are all distractors plausible? | 68.9 |
11 | Are there six or less thinking steps needed to solve this problem? | 94.5 |
12 | Does the exam avoid placing three or more items that assess the same concept or skill next to each other? | 95.3 |
13 | Does the exam avoid placing three or more difficult items next to each other? | 100.0 |
14 | Is there an approximately even distribution of correct answer choices? | 86.0 |
15 | Does the exam avoid linking performance on one item with performance on others? | 100.0 |
The number and percentages of items that adhered to and violated specific item writing guidelines (1–11), are found in Table 5. Boxes highlighted in bold represent guidelines classified as having a high level of adherence (above 90%) where those highlighted in italics represent guidelines classified as having a low level of adherence (below 75%).
Abbreviated criterion | Applicable items | Adhered (yes) | Violated (no) | N/A | Adhered % | Violated % | |
---|---|---|---|---|---|---|---|
Bold: above 90% adherence. Italics: below 75% adherence. | |||||||
1 | Clear/succinct | 1019 | 960 | 59 | — | 94.2 | 5.8 |
2 | Bolded negative phrase | 77 | 65 | 12 | 942 | 84.4 | 15.6 |
3 | Ascending/descending order of choices | 311 | 226 | 85 | 708 | 72.7 | 27.3 |
4 | Approx. even answer choice length | 298 | 249 | 49 | 721 | 83.6 | 16.4 |
5 | Avoid ‘all of the above’ | 1019 | 973 | 46 | — | 95.5 | 4.5 |
6 | Avoid grammar/phrasing cues | 1019 | 1013 | 6 | — | 99.4 | 0.6 |
7 | Answer without answer choices | 1019 | 720 | 299 | — | 70.7 | 29.3 |
8 | Avoid K-type | 1019 | 963 | 55 | — | 94.5 | 5.4 |
9 | Linked to objective | 96 | 87 | 9 | — | 90.6 | 9.4 |
10a | Plausible distractors | 1019 | 447 | 542 | — | 46.8 | 53.2 |
10b | Plausible distractors | 1019 | 206 | 813 | — | 20.2 | 79.8 |
11 | Six or less thinking steps | 1019 | 972 | 44 | — | 95.4 | 4.3 |
Criteria 12–15, found in Table 6, apply to exams as a whole and not to individual items. Boxes highlighted in bold represent guidelines classified as having a high level of adherence (above 90%) where those highlighted in italics represent guidelines classified as having a low level of adherence (below 75%).
Abbreviated criterion | Applicable exams | Adhered (yes) | Violated (no) | N/A | Adhered % | Violated % | |
---|---|---|---|---|---|---|---|
Bold: above 90% adherence. Italics: below 75% adherence. | |||||||
12 | Avoid 3+ items testing the same concept next to each other | 43 | 35 | 8 | 0 | 81.4 | 18.6 |
13 | Avoid 3+ difficult items next to each other | 43 | 42 | 1 | 0 | 97.7 | 2.3 |
14 | Approx. even answer key distribution | 43 | 17 | 26 | 0 | 39.5 | 60.5 |
15 | Avoid linking items based on performance | 43 | 43 | 0 | 0 | 100 | 0 |
Criteria two, three, and four only applied to a small portion of the items analyzed. See Table 4. Criterion two applied to 77 items with 84.4% of those items adhering to the guideline and 15.6% in violation. Criterion three applied to 311 items with 72.7% of those items adhering to the guideline and 27.3% in violation. Criterion four applied to 298 items with 83.6% of those items adhering to the guideline and 16.4% in violation. The remaining criteria applied to all available items or exams.
Additionally, the inter-rater reliability procedure of rating items individually, calculating agreement, and then coming to consensus on differences has been used in other studies successfully (Srinivasan et al., 2018).
Once developed, the instrument was tested for reliability and was shown to have a high level of reliability (91.8% agreement and 0.836 Krippendorff alpha) which is similar to other exam evaluation instruments such as the Cognitive Complexity Rating Instrument (Knaus et al., 2011) and the Three Dimensional Learning Assessment Protocol (3D-LAP) (Laverty et al., 2016). It is important to note that the Cognitive Complexity Rating Instrument uses an interval rating scale verses the categorical scale used by the IWFEI. Although this makes it difficult to directly compare reliability values, they are at a similar high level.
The criterion that presented the greatest difficulty in using reliably was criterion 10 ‘Are all answer choices plausible?’. Initially, we began our analysis of the historical exam data using definition 10a which defines implausible distractors as being made with student errors or misconceptions. This proved to be an unreliable way to evaluate historical, non-self-constructed exam items with a percent agreement between raters of 68.9%. Although definition 10a may have utility when evaluating one's own exam items before administration, it was not appropriate when evaluating items created by other instructors. Conversely, definition 10b which defines implausible distractors as ones that fewer than five percent of students select, would be appropriate for evaluating non-self-constructed exams, but would be impossible for analyzing exam items before administration or without exam statistics available. We see both definitions as having merit when used in appropriate situations. We foresee that having both definitions will increase the utility of the IWFEI and make it more useful to researchers and instructors alike.
Fig. 8 Overly wordy item identified using the IWFEI (left); overly wordy item example from (Towns, 2014) (right) https://pubs.acs.org/doi/abs/10.1021/ed500076x further permissions related to the right side of this figure should be directed to the ACS. |
In another example, the item on the left-hand side of Fig. 9 was identified as a K-type item by using the IWFEI. This item format has answer choices that can be easily eliminated based on analytic reasoning and thus may not provide the most valid data on what students know. The item on the right in Fig. 9 is an example of a K-type item as found in the literature.
Fig. 9 K-type item identified using the IWFEI (left); K-type item example right; Reproduced from (Albanese, 1993) with permission. |
In a third example, the items in Fig. 10 were identified as containing implausible distractors by using the IWFEI. In the literature, an implausible distractor has been given an operational definition as a distractor that fewer than 5% of students select. The percentage of students who selected each answer choice is indicated. From this, we see that the item on the left-hand side contains two implausible distractors, b and e, while the item on the right-hand side contains one implausible distractor, e. In the item on the right, e quintinary, is not a level of protein structure. This may indicate that the instructor included this as an answer choice solely for formatting reasons or a belief that in MC items more choices are better, despite their quality.
Fig. 10 Items with implausible distractors identified using the IWFEI; percentages of students choosing each answer choice is indicated. |
The three cases discussed above show examples of item writing flaws identified with the IWFEI and how they are comparable to what is described in the literature. This provides evidence that the IWFEI can be used in a valid way to identify items that contain flaws in their construction.
Once an item has been identified that contains violations of accepted guidelines, the user can then decide if/how they will modify the item. It is important to note that the IWFEI is not intended to identify bad items, but to identify items that may need to be modified based on the instructors or researcher's discretion.
To demonstrate this, a flawed item and its revision are shown in Fig. 11. The original item in Fig. 11 was identified as containing two flaws, an incomplete stem and implausible distractors (violations of criterion 7 and 10). In the revised version of the item, the stem was rewritten to contain a complete problem statement and the implausible distractors were removed.
The most common flaw in the exams analyzed was the inclusion of implausible distractors at 79.8% of items. Although initially surprising, this percentage was similar to results in other studies where 90.2% and 100% of items, respectively contained implausible distractors (Haladyna and Downing, 1993; Tarrant and Ware, 2010).
The most common flaws found in the chemistry exams analyzed were different than the most common flaws found in nursing, pharmacy and medical examinations (Stagnaro-Green and Downing, 2006; Tarrant et al., 2006; Tarrant and Ware, 2008; Pate and Caldwell, 2014). In this study the most common flaws were: including implausible distractors (79.8%), uneven answer choice distribution (60.5%), and including incomplete stems (29.3%). When compared to the most common flaws in other studies (Table 1), the only overlapping most common flaw was including incomplete stems.
Additionally, the IWFEI has been used and validated with a sample of 43 1st semester general chemistry exams, so it is unclear how it will perform with exams from other disciplines or content areas. With that said, we do foresee the IWFEI being able to be used for a wider variety of courses than introductory chemistry, it has just not been used in these contexts yet. However, the criteria or guidelines in the IWFEI have been used in other disciplines, so we do not envision further validation to be an issue. We invite those from other disciplines to use the IWFEI.
We recognize that there are many other facets to consider when creating MC assessments that are not included in this instrument. The IWFEI was not intended to address all aspects of MC assessment design, but to help identify common item writing guideline violations found in the literature. Using this instrument alone does not guarantee an item or exam will perform as desired, although we do foresee its use improving the quality of MC assessment in regards to item writing guideline adherence.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c8rp00262b |
This journal is © The Royal Society of Chemistry 2019 |