Psychometric analysis of the resonance concept inventory†
Received 
      10th June 2024
    , Accepted 29th October 2024
First published on 29th October 2024
Abstract
Many undergraduate chemistry students hold alternate conceptions related to resonance—an important and fundamental topic of organic chemistry. To help address these alternate conceptions, an organic chemistry instructor could administer the resonance concept inventory (RCI), which is a multiple-choice assessment that was designed to identify resonance-related alternate conceptions held by organic chemistry students. In this study, two iterations of the RCI were administered to undergraduate organic chemistry students: the RCI-Pilot (N = 484) and the RCI-Final (N = 595). Evidence was collected to support the quality of the RCI items, the validity of the data obtained with the RCI based on internal structure, and the reliability of the data obtained with the RCI. Classical test theory (CTT) was utilized to determine the quality of the items. To gather validity evidence, the Rasch model was used and a differential item functioning (DIF) analysis was conducted. Reliability estimates were made using McDonald's Omega. Since validity and reliability evidence was gathered for the assessment scores, the data obtained in this study supports the use of the 14-item RCI for detecting student alternate conceptions with resonance.
    
      Introduction
      
        Resonance concept inventory
        Resonance is considered by experts in the field to be a fundamental concept in organic chemistry (Duis, 2011). Specifically, in one study, experts reported resonance as one of the most important concepts for students to comprehend to develop a proficiency in reaction mechanisms in organic chemistry (Nedungadi and Brown, 2021). Since the concept is crucial, it is important that organic chemistry students gain a strong understanding of resonance early on in their undergraduate careers.
        However, students seem to hold alternate conceptions—beliefs that are inconsistent with accepted scientific ideas—pertaining to resonance which lead to difficulties as they progress through the organic chemistry sequence. Students have been found to hold the following resonance-related alternate conceptions: resonance and equilibrium are the same, resonance states change over time, resonance structures must contain charges, and atoms can rearrange between resonance structures (Duis, 2011; Widarti et al., 2017; Tetschner and Nedungadi, 2023). Alternate conceptions about the octet rule and about the drawing of Lewis structures have also been observed, and students who hold these alternate conceptions tend to struggle with drawing and analyzing resonance structures (Betancourt-Perez et al., 2010). Multiple studies have found students to possess a limited understanding of the relationship between resonance structures and the resonance hybrid (Betancourt-Perez et al., 2010; Duis, 2011; Kim et al., 2019; Xue and Stains, 2020). Additionally, it has been reported that many students tend to focus on the operational definition of resonance and lack the conceptual understanding that would allow them to connect ideas of resonance to the reactivity and stability of organic compounds (Xue and Stains, 2020; Brandfonbrener et al., 2021; Tetschner and Nedungadi, 2023).
        Since resonance is a fundamental concept to organic chemistry and can be difficult for students to master, the large-scale assessment of student understanding of resonance would be beneficial to organic chemistry instructors. Previous studies that have explored student difficulty and resonance-related alternate conceptions were qualitative studies that included smaller samples from a limited number of universities (Betancourt-Perez et al., 2010; Kim et al., 2019; Xue and Stains, 2020; Brandfonbrener et al., 2021). The development of an assessment related to resonance for large-scale administration could help obtain more generalizable data and give further information regarding student alternate conceptions.
        One such assessment is a concept inventory. A concept inventory is a multiple-choice assessment that is developed using commonly held alternate conceptions to create distractors, incorrect answer choices, for each item on the assessment (Lindell et al., 2007). Concept inventories have been developed in different sub-disciplines of chemistry (Mulford and Robinson, 2002; Villafañe et al., 2011; Bretz and Linenberger, 2012; Wren and Barbera, 2013; Brandriet and Bretz, 2014; Brown et al., 2015; Dick-Perez et al., 2016; Bretz and Murato Mayo, 2018; Abell and Bretz, 2019; Atkinson et al., 2020). However, only a few concept inventories have been developed in organic chemistry (McClary and Bretz, 2012; Leontyev, 2015; Nedungadi et al., 2021). One such concept inventory is the Fundamental Concepts for Organic Reaction Mechanisms Inventory which was designed to detect student alternate conceptions on fundamental general chemistry and organic chemistry concepts that are important for organic reaction mechanisms (Nedungadi et al., 2021). This instrument has items related to resonance, but since it is not a single topic concept inventory, it does not provide depth in student understanding of resonance. A concept inventory with all items related to resonance would help provide this depth in student understanding of resonance and would add to the existing literature. Information gathered with this concept inventory could assist organic chemistry instructors in making instructional modifications to address alternate conceptions held by their students. Concept inventories are often also utilized to measure learning gains of students, so a concept inventory related to resonance could be used to track student understanding of resonance as they progress through an organic chemistry sequence (Sands et al., 2018).
        The initial development of items for a resonance concept inventory (RCI) has been reported (Tetschner and Nedungadi, 2023). Open-ended items were developed and administered to first-semester undergraduate organic chemistry students to obtain the most commonly occurring alternate conceptions. From these alternate conceptions, distractors were derived to convert the open-ended items to multiple-choice items. The common alternate conceptions related to resonance that were identified were reported in this study. Additionally, evidence for validity based on test content and response processes was reported during the development of the RCI items. The development of assessment tools like concept inventories is an iterative process. Therefore, the RCI went through multiple iterations before the RCI-Pilot was developed (Tetschner and Nedungadi, 2023).
      
      
        Item quality
        Classical test theory (CTT) has been routinely used as the first step in determining the quality of items when developing concept inventories (Mulford and Robinson, 2002; Villafañe et al., 2011; Bretz and Linenberger, 2012; Wren and Barbera, 2013; Brandriet and Bretz, 2014; Brown et al., 2015; Dick-Perez et al., 2016; Bretz and Murato Mayo, 2018; Abell and Bretz, 2019; Atkinson et al., 2020). From a CTT perspective, the raw score, which is the number of items answered correctly, is assumed to consist of a respondent's true ability plus or minus some degree of measurement error (Crocker and Algina, 1986). This suggests that correctly answering 60 out of 100 items, for example, would mean the same thing for everyone. If two respondents had a raw score of 60 out of 100 items, we conclude that they have the same level of knowledge of that domain, irrespective of the fact that the 40 items they missed could have been completely different (Stemler and Naples, 2021). Therefore, CTT is sample dependent. The calculation of parameters in CTT involves simpler mathematical procedures which make them easier to follow. Users of assessment tools tend to encounter these parameters more often, and therefore, it is useful information for practitioners who want to be able to obtain item quality information quickly.
      
      
        Validity
        When developing and evaluating assessments, it is crucial that the validity and reliability of the data obtained with the assessment is established (Bandalos, 2018). Validity describes the extent to which an instrument measures what it intends to measure, and reliability describes the consistency of measurements across replications—or across administrations of an assessment (American Educational Research Association et al., 2014). During the development of an assessment, validity evidence should be gathered that supports the intended interpretations of the assessment, and the reliability of the data obtained using the assessment should be evaluated.
        According to the Standards, there are five types of validity evidence: (1) evidence based on test content, (2) evidence based on response processes, (3) evidence based on internal structure, (4) evidence based on relations to other variables, and (5) evidence for consequences of testing (American Educational Research Association et al., 2014). As the five types of validity evidence are different aspects of the validity argument and not distinct types of validity, it is advantageous to acquire multiple types of validity evidence during the development of an assessment. Although, it is not always necessary for all five types of validity evidence to be gathered (Kane, 2013). During the development of the RCI-Pilot, evidence for validity was gathered based on test content and based on response processes (Tetschner and Nedungadi, 2023). In this current study, evidence for validity based on internal structure was obtained.
        Validity evidence based on internal structure indicates the extent to which the relationships between items or components of an assessment align with the intended construct of the assessment (American Educational Research Association et al., 2014). Some assessments are designed to measure a single dimension and are said to be unidimensional, whereas other assessments are designed to measure multiple dimensions. Assessments are typically based on theory about the dimensionality of the construct being measured, and the interpretations we make regarding scores from the assessment are based on this assumed dimensionality (Bandalos, 2018). The RCI is assumed to be measuring a unidimensional construct: student understanding of resonance. It is important to determine the degree to which the RCI items live up to these dimensional expectations since this is crucial in determining the degree to which we are justified in interpreting the RCI scores (Bandalos, 2018). Since the RCI is designed to detect student alternate conceptions, the interpretations made from the RCI scores would directly impact instructional modifications and interventions. Therefore, it is important that evidence for validity based on internal structure is gathered with the RCI.
        There are different types of evidence for validity based on internal structure that can be obtained, including item correlation evidence and factor analysis evidence (which includes exploratory and confirmatory factor analysis). Another type of validity evidence based on internal structure utilizes item response theory (IRT) methods. These methods are like confirmatory factor analysis (CFA) methods with the difference being that CFA methods assume that items are measured on a continuous scale, whereas IRT methods are generally applied to dichotomously scored items (Bandalos, 2018). The IRT models are used to estimate the probability that a respondent with a given ability level will answer an item correctly.
        One model that has similarities to CTT and IRT is the Rasch model. Rasch analysis has been utilized in the development of a few concept inventories in chemistry (Wren and Barbera, 2014; Nedungadi et al., 2021). The Rasch model transforms the raw score into a logit measure called theta which represents the person ability estimate (Rasch, 1980). This logit transformation stretches out the tails of the distribution to approximate a normal curve, thereby reflecting the assumption of normality that the interval between scores at different points is not equal (Stemler and Naples, 2021). This is where the Rasch model differs from CTT. The Rasch model puts person ability and item difficulty on the same scale so that one can see which items were answered correctly by people at a specific ability level. The power of the Rasch model is its ability to help one build a measurement scale and then check to see if the data fit the model, which is usually done by analyzing fit statistics (Stemler and Naples, 2021). This helps one determine if a linear scale has been built that works the same way for all test takers and if the test scores have consistent meaning for all test takers (Stemler and Naples, 2021). Thus, utilizing the Rasch model would help obtain validity evidence based on internal structure for the RCI.
        Evidence for validity based on internal structure can also be obtained by conducting differential item functioning (DIF) analysis (Cook and Beckman, 2006). DIF refers to differences between groups (of the same ability level) in item performance. DIF explores bias in testing, and difference in item scores can be seen as construct-irrelevant variance that is not explained by the item construct (Crocker and Algina, 1986; Clauser and Mazor, 1998; Osterlind and Everson, 2009). DIF analysis has been used very sparingly in the development of concept inventories in chemistry (Nedungadi et al., 2022) even though the importance of utilizing DIF analysis when evaluating concept inventories has been reported (Martinková et al., 2017). Differences in gender subgroups have been studied extensively in different areas of educational research (Maccoby and Jacklin, 1974; Cole, 1997; Halpern, 1997). DIF analysis by gender not only provides evidence for validity based on internal structure, but it also assists in developing equitable assessments that do not discriminate against individuals (Libarkin, 2008). There have been recent strides to increase the number of women in the physical sciences, so it is important to design assessments in chemistry that are not discriminating one gender over another (NSF Program for Gender Diversity in STEM Education, 2003).
      
      
        Reliability
        It is also important to establish the reliability of the data obtained using an assessment. In the development of chemistry concept inventories, Cronbach's alpha has been used to report internal consistency reliability. This refers to how closely the items measure the same construct (Crocker and Algina, 1986). However, arguments have been made against the use of Cronbach's alpha for assessment instruments, especially concept inventories owing to the fragmented knowledge being measured (Adams and Wieman, 2011). McDonald's omega is conceptually like alpha, but it allows each item to be associated with the common construct influencing the true value of each item to a different degree (Bandalos, 2018). Omega has been recommended as a more appropriate single-administration reliability coefficient than alpha (Komperda et al., 2018). According to the Standards (American Educational Research Association et al., 2014), “there is no single, preferred approach to quantification of reliability/precision. No single index accurately conveys all the relevant information. No one method of investigation is optimal in all situations” (p. 41). Just like how many types of validity evidence are reported during the development of an assessment, it is encouraged to report many types of reliability estimates as well (Barbera et al., 2021).
      
      
        Aim of study
        In this study, the RCI-Pilot was administered to first-semester undergraduate organic chemistry students. Based on the psychometric analysis, the items on the RCI-Pilot were modified to give the RCI-Final which was also administered to first-semester undergraduate organic chemistry students. Three research questions guided this study:
        1. What evidence supports the quality of items on the RCI-Pilot and the RCI-Final?
        2. What evidence supports the validity of data obtained using the RCI-Pilot and the RCI-Final based on internal structure?
        3. What evidence supports the reliability of data obtained using the RCI-Pilot and the RCI-Final?
      
    
    
      Methods
      
        Instrument and data collection
        The design and development of the RCI items were reported in a previous study (Tetschner and Nedungadi, 2023). Open-ended items covering common topics related to resonance that are taught in undergraduate organic chemistry classes were developed and administered to first-semester undergraduate organic chemistry students. Commonly occurring alternate conceptions were used as distractors to develop multiple-choice items. Evidence for validity based on test content and response processes was obtained through surveys with organic chemistry faculty and open-ended interviews with students, respectively. This process resulted in two different iterations of the RCI before the 14-item RCI-Pilot was developed.
        For this study, the RCI-Pilot was administered to first-semester undergraduate organic chemistry students (N = 484) from three different mid-sized universities in the USA. This administration was conducted approximately one month into the first-semester organic chemistry class after students were taught resonance and formally assessed on the concept in the form of a quiz or an exam. The students were administered the RCI-Pilot during their regular class period. Students were asked to answer the RCI-Pilot items on a bubble sheet, and they took approximately 20 minutes to complete it.
        The psychometric analysis of the data collected using the RCI-Pilot resulted in the modification of five items to give the 14-item RCI-Final. The RCI-Final was administered to first-semester undergraduate organic chemistry students (N = 595) from four mid-sized universities in the USA. The RCI-Final was also administered approximately one month into the first-semester organic chemistry class. The administration took place during the regular class period, and students took approximately 20 minutes to answer the items on a bubble sheet. Both the RCI-Pilot and the RCI-Final also contained three demographic questions related to gender, major of study, and year of study. Table 1 gives the demographic information for both sets of data collected.
        
Table 1 Demographic information for the RCI-Pilot and RCI-Final administrations
		
            
              
              
              
              
                
                  |  | RCI-Pilot administration (%) | RCI-Final administration (%) | 
              
              
                
                  | Male | 40 | 33 | 
                
                  | Female | 59 | 65 | 
                
                  | Other | 1 | 2 | 
                
                  |  | 
                
                  | Chemistry/biochemistry | 19 | 17 | 
                
                  | Biology | 39 | 41 | 
                
                  | Other | 42 | 42 | 
                
                  |  | 
                
                  | Freshman | 0 | 3 | 
                
                  | Sophomore | 63 | 50 | 
                
                  | Junior | 29 | 32 | 
                
                  | Senior | 7 | 11 | 
                
                  | Other | 1 | 4 | 
              
            
      
      
        Ethical considerations
        Institutional review board (IRB) approval was obtained from the university where the research was primarily conducted before all data collection. No identifying information, such as names of the participants, was collected to maintain complete anonymity. If participants accidentally recorded their names on the provided bubble sheets, then their names were replaced by codes to maintain confidentiality.
      
      
        Data analysis
        The RCI-Pilot data and RCI-Final data were scored dichotomously where all correct answers were given a score of 1 and all incorrect answers were given a score of 0. However, if a student did not respond to more than two items on the assessment, then the student's data was removed entirely from the analysis. This data was excluded to avoid falsely scoring certain item responses, assuming students who did not respond to three or more items on the assessment were not fully participating. The same analysis was conducted for both the RCI-Pilot and RCI-Final data sets. The data analysis conducted to answer each research question is described below.
        
          RQ 1. What evidence supports the quality of items on the RCI-Pilot and the RCI-Final?. 
          This research question was answered by determining two parameters, item difficulty and item discrimination, within classical test theory (CTT). Item difficulty (P) is the proportion of respondents who correctly answered an item (Bandalos, 2018). For example, if 30 out of 100 respondents got the item correct, then P = 30/100 or 0.3. Item difficulties can range from 0 to 1.0 with values closer to 0 indicating difficult items and values closer to 1 indicating easy items. Difficulties in the range of about 0.3 to 0.8 are acceptable for most assessments (Kaplan and Saccuzzo, 1997).
          Item discrimination indices (D) provide information on the degree to which an item can be used to make distinctions between respondents with a high level of skill or knowledge from respondents with a low level of skill or knowledge (Bandalos, 2018). When determining D, two groups are identified: a high performing group (upper group) and low performing group (lower group). D is calculated by taking the proportion of those in the upper group who respond correctly minus the proportion in the lower group who respond correctly (Bandalos, 2018). Since both the RCI-Pilot and the RCI-Final were administered to a large sample of students, the top 27% of respondents were used as the upper group and the bottom 27% of respondents were used as the lower group (Kelley, 1939). Item discrimination indices above 0.3 are generally considered to be acceptable (Doran, 1980).
          The RCI-Pilot items that did not have acceptable difficulty or discrimination values were analyzed, and modifications were made to develop the items on the RCI-Final. Additionally, distractor analysis was performed to determine the effectiveness of each item's distractors. Modifications were made to specific distractors on RCI-Pilot items that were selected by less than 10% of the students. All analysis was conducted using Microsoft Excel.
         
        
          RQ 2. What evidence supports the validity of data obtained using the RCI-Pilot and the RCI-Final based on internal structure?. 
          This research question was answered by conducting Rasch analysis and DIF analysis. The Rasch model is a probabilistic model where the probability of answering an item correctly is proportional to the ability level. One assumption in the Rasch model is unidimensionality, which means that the scale measures a single latent trait (Bond and Fox, 2015). Principal component analysis (PCA) of residuals was conducted to evaluate the assumption of unidimensionality. If the observed raw variance for measures is greater than 20%, then the model suggests unidimensionality (Bond and Fox, 2015). Though, it is important to note that this residual analysis method cannot be used to gather definitive proof for unidimensionality. Uncorrelated residual errors may suggest unidimensionality as they are a necessary condition for unidimensionality, but uncorrelated residuals cannot definitively prove the presence of unidimensionality (Rhemtulla et al., 2020). The item fit to the Rasch model was analyzed using fit statistics. Fit statistics can be used to gather further information about items. Though, they should not be used as definitive measures of validity (Bond and Fox, 2015). Two types of fit statistics, infit and outfit, were analyzed. The infit statistic is weighted and more sensitive to difficulty and ability estimates, therefore it is routinely considered first before the outfit statistics (Bond and Fox, 2015). Within infit and outfit statistics the mean square statistic (MNSQ) shows the size of randomness and the standardized statistic (ZSTD) checks to see if the data perfectly fits the model. The acceptable values for MNSQ infit and outfit is 0.70–1.30 for a low stakes multiple-choice test, and if the items have MNSQ values within this range, then ZSTD can be ignored (Bond and Fox, 2015). All Rasch analysis was conducted using Winsteps 5.6.4 software (Linacre, 2023).
          DIF analysis was conducted based on gender. The word “gender” in this study is based on the biological difference of gender rather than the socially constructed associations of gender (Weisstein, 1968; Crawford and Marecek, 1989) and is consistent with DIF literature where gender is used to discuss groups differentiated by sex. If the participants did not report gender, then their responses were excluded from the DIF analysis. This exclusion resulted in a sample size of 464 (189 males and 275 females) for the RCI-Pilot data and a sample size of 539 (183 males and 356 females) for the RCI-Final data. For DIF analysis, the larger group is referred to as the reference group, and the smaller group is referred to as the focal group. The reference group was females, and the focal group was males for the DIF analysis conducted. For dichotomous data, two types of DIF exist, namely uniform and nonuniform. Uniform DIF suggests that the probability of one group to answer an item correctly is constantly higher than the other group at all ability levels, whereas nonuniform DIF suggests that the probability of one group to answer an item correctly is not constantly higher than the other group at all ability levels (Walker, 2011).
          Three methods of DIF detection were conducted, namely the Mantel–Haenszel (MH) method (Mantel and Haenszel, 1959; Holland and Thayer, 1988), the logistic regression method (Swaminathan and Rogers, 1990), and the Rasch analysis method (Hambleton et al., 1991). It has been suggested that multiple methods should be used to detect both uniform and nonuniform DIF (Kendhammer and Murphy, 2014). These three methods provide good triangulation since they give different types of information. The MH method works well for detecting only uniform DIF. The logistic regression method works well for detecting both uniform and nonuniform DIF, and the Rasch analysis method provides more precise estimates of latent traits (Nedungadi et al., 2022). All DIF analyses were conducted using the R (R Core Team, 2023) packages difR (Magis et al., 2020) and ltm (Rizopoulos, 2022).
         
        
          RQ 3. What evidence supports the reliability of data obtained using the RCI-Pilot and the RCI-Final?. 
          This research question was answered by obtaining McDonald's omega values and Rasch reliability estimates of item separation and person separation. Values above 0.7 for McDonald's omega are considered acceptable (Crocker and Algina, 1986). Confidence intervals for the omega values were also calculated. This analysis was conducted using the R (R Core Team, 2023) packages psych (Revelle, 2024) and MBESS (Kelley and Lai, 2017).
          Rasch item separation and person separation reliability values higher than 0.8 are acceptable (Bond and Fox, 2015). This is a little easier to achieve for item separation, but harder to achieve for person separation since person separation depends on the number of items on the assessment. The Rasch analysis was conducted using Winsteps 5.6.4. software (Linacre, 2023).
         
      
    
    
      Results and discussion
      
        Descriptive statistics
        The descriptive statistics for the data obtained from pilot and final administrations of the RCI were computed using Jamovi 2.2.5. software (The Jamovi Project, 2024) and are given in Table 2. The results are comparable between the pilot and final administrations. Both sets of data show a normal distribution as indicated by the Shapiro-Wilk W values of 0.97 for the RCI-Pilot and 0.98 for the RCI-Final. The mean value for the RCI-Pilot was 8.66 with 95% probability of the value being between 8.44 and 8.87. The mean value for the RCI-Final was 8.13 with 95% probability of the value being between 7.95 and 8.31.
        
Table 2 Descriptive statistics for RCI-Pilot and RCI-Final data
		
            
              
              
              
              
                
                  |  | RCI-Pilot administration | RCI-Final administration | 
              
              
                
                  | N | 484 | 595 | 
                
                  | Mean | 8.66 | 8.13 | 
                
                  | Median | 9.00 | 8.00 | 
                
                  | Standard deviation | 2.44 | 2.28 | 
                
                  | Std. error mean | 0.11 | 0.09 | 
                
                  | 95% CI mean lower bound | 8.44 | 7.95 | 
                
                  | 95% CI mean upper bound | 8.87 | 8.31 | 
                
                  | Skewness | −0.29 | −0.07 | 
                
                  | Kurtosis | −0.43 | −0.43 | 
                
                  | Shapiro-Wilk W | 0.97 | 0.98 | 
              
            
      
      
        Item quality
        The difficulty and discrimination values for the 14 items on the RCI-Pilot are depicted in Fig. 1. The items inside the box in the figure have acceptable difficulty and discrimination indices, so the items outside the box were analyzed further. Of the 14 RCI-Pilot items, six items were outside the acceptable range for difficulty. Four of these six items had difficulty values above 0.8, indicating that these were very easy items. Two of the items, items 5 and 7, were from the same question category and were the only items on the RCI that involved curved arrow drawings. Thus, if these items were removed, the RCI would not be able to assess students’ understanding of an important aspect of resonance. The other two items, items 3 and 9, were retained since important information regarding student alternate conceptions could be gathered from these two items.
        |  | 
|  | Fig. 1  Difficulty and discrimination indices for the 14 RCI-Pilot items. |  | 
All four items were modified based on information obtained from the distractor analysis. For example, item 7, which is depicted in Fig. 2, had distractors that were not utilized much. Answer choice (A), which is the correct answer, was selected by 89% of the students. The other three answer choices were under-utilized—only 3.5% selected (B), 4% selected (C), and 3.5% selected (D). These three distractors were modified using previously identified alternate conceptions to increase the effectiveness of the distractors (Tetschner and Nedungadi, 2023).
        |  | 
|  | Fig. 2  Item 7 on the RCI-Pilot. |  | 
Two items, items 8 and 13, had difficulty values below 0.3, which indicated that these were difficult items. These two items were also the only items that had discrimination indices below the acceptable value of 0.3, which indicated that the items poorly discriminated between the high performing and low performing groups. Both items were designed to test students’ understanding of the relationship between resonance and stability. The two items were analyzed closely to determine if they were providing useful information regarding student alternate conceptions. Fig. 3 displays one of these difficult items, item 8.
        |  | 
|  | Fig. 3  Item 8 on the RCI-Pilot. |  | 
Since the nitro substituent in the para position in anion 1 allows for more charge delocalization and, therefore, allows for the oxygen atom with the negative charge to be better stabilized, answer choice (A) is the correct answer for item 8. In a previous study, multiple organic chemistry instructors reported that item 8 appropriately covers resonance, is relevant to the concept of resonance, and has a correct proposed answer (Tetschner and Nedungadi, 2023). Despite this, students seemed to struggle with the item. The distractor analysis of item 8 indicated that all answer choices were being consistently selected by students; every answer choice was selected by more than 10% of the students. However, the most popular selection choice was a distractor and not the correct answer. 51% of the students selected answer choice (D).
        Only 20% of the students selected the correct answer choice, (A). The other two distractors, (B) and (C), were chosen by 14% and 15% of the students, respectively. Since the difficulty of item 8 was high (P = 0.20), the discrimination index was low (D = 0.27), and many students were selecting the wrong answer choice, item 8 was flagged for potential removal at this stage of the analysis, but it was not modified. Further information was needed to determine if the item was a poor item that should be removed from the assessment or if the item could still provide useful information about student understanding. The 14-item RCI-Final was developed from the modifications that were made to the RCI-Pilot.
        Difficulty and discrimination indices were determined for the items from the administration of the RCI-Final. These values are depicted in Fig. 4. Only 2 items, item 7 and 8, had difficulty values and discrimination indices outside the acceptable ranges, suggesting that the modifications made to the RCI-Pilot items improved the quality of most of the items. Item 7 was again found to be very easy with a difficulty value of 0.93. Very easy items tend to be less effective at discriminating between high performing and low performing groups, and so, the discrimination index for item 7 was predictably also outside the acceptable range and was 0.15. Although this item is very easy for students, it still provides valuable information. For example, the item suggests that students do not seem to have difficulty with drawing curved arrows when depicting resonance structures. Additionally, in the development of assessment instruments, it is useful to have items with varying difficulty to accurately assess student conceptual understanding (Bandalos, 2018). Thus, item 7 was retained.
        |  | 
|  | Fig. 4  Difficulty and discrimination indices for the 14 RCI-Final items. |  | 
Item 8 on the RCI-Final again had difficulty and discrimination indices outside the acceptable range with a difficulty index of 0.21 and a discrimination index of 0.15. This item was also retained based on the feedback obtained from organic chemistry faculty (Tetschner and Nedungadi, 2023) and based on further validity and reliability evidence that was obtained as discussed below.
        Notably, item 11—an item whose design and development has been reported in detail—(Tetschner and Nedungadi, 2023) was within the acceptable ranges for difficulty and discrimination for both administrations. The modifications made to item 11 based on organic chemistry faculty feedback and student responses collected (Tetschner and Nedungadi, 2023) appear to have produced an item of good quality in terms of the CTT parameters.
      
      
        Evidence for validity based on internal structure
        
          Rasch analysis. 
          To gather evidence for validity based on internal structure, Rasch analysis was conducted. The Rasch model assumes unidimensionality, and so, principal component analysis (PCA) of residuals was conducted to determine whether the data collected from the RCI-Pilot and RCI-Final administrations suggests unidimensionality. For the RCI-Pilot items, the observed raw variance explained by measures was 33.1%, and for the RCI-Final items, the observed raw variance for measures was 30.6%. Since both values were above 20%, this suggested that both versions of the RCI assumably measure a unidimensional construct (Bond and Fox, 2015). This indicates that the RCI-Pilot and the RCI-Final may be able to measure the latent trait of student understanding of resonance.
          The fit of the data to the Rasch model was determined by analyzing the fit statistics. Table 3 displays infit and outfit statistics for both the RCI-Pilot and the RCI-Final. All 14 items on the RCI-Pilot had mean square (MNSQ) infit and outfit values within the acceptable range of 0.7–1.3 (Bond and Fox, 2015). This indicated that the data obtained from the RCI-Pilot fit well to the Rasch model. All items on the RCI-Final had infit statistics within the acceptable range. Only one item on the RCI-Final, item 8, had an MNSQ outfit value outside the range. However, item 8 had an acceptable MNSQ infit value (1.18) meaning that the item fit well with regards to typical or expected responses near the estimated ability level but did not fit as well across all ability groups.
          
Table 3 Infit and outfit statistics for the RCI-Pilot and RCI-Final data
		
              
                
                
                
                
                
                
                
                
                
                
                  
                    | Item | Infit | Outfit | 
                  
                    | RCI-Pilot | RCI-Final | RCI-Pilot | RCI-Final | 
                  
                    | MNSQa | ZSTDb | MNSQ | ZSTD | MNSQ | ZSTD | MNSQ | ZSTD | 
                
                
                  
                    | Mean square statistic.
                       Standardized statistic. | 
                
                
                  
                    | 1 | 0.96 | −0.76 | 0.95 | −1.42 | 0.91 | −1.16 | 0.93 | −1.51 | 
                  
                    | 2 | 1.13 | 3.06 | 1.04 | 1.35 | 1.18 | 2.56 | 1.08 | 1.64 | 
                  
                    | 3 | 1.01 | 0.20 | 0.98 | −0.29 | 0.94 | −0.37 | 0.91 | −0.91 | 
                  
                    | 4 | 1.06 | 1.32 | 1.01 | 0.33 | 1.18 | 2.38 | 1.13 | 1.82 | 
                  
                    | 5 | 0.95 | −0.66 | 0.95 | −1.15 | 0.83 | −1.39 | 0.88 | −1.64 | 
                  
                    | 6 | 1.00 | −0.02 | 0.96 | −0.78 | 0.89 | −0.83 | 0.90 | −1.04 | 
                  
                    | 7 | 0.92 | −0.81 | 0.97 | −0.21 | 0.67 | −1.81 | 0.86 | −0.64 | 
                  
                    | 8 | 1.08 | 1.12 | 1.18 | 2.80 | 1.29 | 2.14 | 1.45 | 3.96 | 
                  
                    | 9 | 0.95 | −0.73 | 0.98 | −0.35 | 0.84 | −1.26 | 0.94 | −0.65 | 
                  
                    | 10 | 0.91 | −2.17 | 0.96 | −1.24 | 0.91 | −1.22 | 0.94 | −1.26 | 
                  
                    | 11 | 0.87 | −3.55 | 1.00 | 0.15 | 0.81 | −3.64 | 1.00 | 0.09 | 
                  
                    | 12 | 1.03 | 0.44 | 0.91 | −1.50 | 1.19 | 1.51 | 0.87 | −1.38 | 
                  
                    | 13 | 1.12 | 2.06 | 1.05 | 1.20 | 1.21 | 2.05 | 1.10 | 1.41 | 
                  
                    | 14 | 0.99 | −0.26 | 1.01 | 0.45 | 0.99 | −0.13 | 1.05 | 1.15 | 
                
              
          The Wright map which plots item difficulty and person ability on the same logit scale is shown for the RCI-Final data in Fig. 5. The Wright map for the RCI-Pilot is given in the ESI.† This type of plot displays individual test takers (as ability groups) alongside items that are arranged vertically by difficulty. The Wright map shows that the difficulty of the items on the RCI-Final range from −2.4 to +2.08 logits, and the overall distribution of the items on the Wright map indicates that there is a good spread of items based on difficulty when compared to person ability. Item 8 is the most difficult item with a Rasch difficulty value of +2.08, but there is an ability group (of approximately 28–35 people) above this item. The position of item 8 on the Wright map and acceptable MNSQ infit value suggest that the item fits relatively well to the Rasch model and that the item is not a significant outlier. Item 7 is the easiest item with a Rasch difficulty value of −2.4 and all ability levels are higher than this item. These results are consistent with the CTT results for this item, further indicating that this item is very easy.
          |  | 
|  | Fig. 5  Wright map for RCI-Final items. The measure is based on the JMLE value (item difficulty) of each item. “#” indicates 9 individual test takers, and “.” indicates 1 to 8 individual test takers. |  | 
Since the PCA of residuals for the RCI-Pilot and RCI-Final items suggest that the instrument is likely measuring a unidimensional construct and since the item fit statistics for both iterations indicated that the items fit well to the Rasch model, the instrument appears to measure what it intends to measure—students’ conceptual knowledge of resonance. Additionally, the test scores appear to have a consistent meaning for all test takers. Therefore, this is evidence for internal structure validity.
         
        
          Differential item functioning (DIF). 
          Three methods of DIF detection were used to gather further evidence for internal structure validity. Both the RCI-Pilot items and the RCI-Final items were analyzed for gender-based DIF. The first method utilized was the Mantel–Haenszel (MH) method for testing uniform DIF. The MH plots for the RCI-Pilot and the RCI-Final data are shown in Fig. 6. Uniform DIF was not detected in both the RCI-Pilot and the RCI-Final data. All items are below the threshold Chi-squared statistic line on the plot, meaning that none of the RCI-Pilot items or RCI-Final items showed significant DIF based on gender according to the MH method.
          |  | 
|  | Fig. 6  (a) The MH plot for gender-based DIF for the RCI-Pilot items. (b) The MH plot for gender-based DIF for the RCI-Final items. Note: the numbers within the plot represent items, and the horizontal line marks the threshold Chi-squared statistic value for DIF. |  | 
The results from the MH method were compared to the results obtained from the other two methods: logistic regression and Rasch analysis. The results from the logistic regression confirmed that none of the items on the RCI-Pilot and none of the items on the RCI-Final exhibit uniform DIF. The logistic regression method flagged two RCI-Pilot items, 1 and 4, for exhibiting nonuniform DIF. The presence of nonuniform DIF is usually confirmed by analyzing the item characteristic curves (ICCs) for the items. The ICCs for items 1 and 4 are shown in Fig. 7.
          |  | 
|  | Fig. 7  ICCs for RCI-Pilot items 1 and 4. Note: the reference group was females, and the focal group was males. The y-axis measures the likelihood (the probability) of the group to answer the item correctly, and the x-axis represents the latent trait or ability level (from low to high). |  | 
From the ICCs for both items 1 and 4, there is not enough evidence to suggest that nonuniform DIF exists in these items. In item 1, the reference group (females) outperforms the focal group (males) at all ability levels, but this trend switches slightly at higher ability levels. In item 4, the focal group (males) outperforms the reference group (females) at all ability levels except at very low ability levels. The difference in this performance, as seen in the ICCs, is very minimal, and therefore, there isn’t enough evidence for nonuniform DIF.
          Similarly, the logistic regression method flagged two RCI-Final items, 4 and 13, as exhibiting nonuniform DIF. But analysis of the ICCs shows that there isn’t enough evidence to support nonuniform DIF. The ICCs for these two RCI-Final items are shown in Fig. 8.
          |  | 
|  | Fig. 8  ICCs for RCI-Final items 4 and 13. Note: the reference group was females, and the focal group was males. The y-axis measures the likelihood (the probability) of the group to answer the item correctly, and the x-axis represents the latent trait or ability level (from low to high). |  | 
The results from the MH method and the logistic regression method were further confirmed through Rasch analysis for detecting DIF. Only one item on the RCI-pilot, item 2, showed uniform DIF by Rasch analysis. No items on the RCI-Final showed uniform DIF through Rasch analysis. Overall, the results from these three DIF detection methods indicate that there is strong evidence for the internal structure validity of the RCI-Final. No items showed uniform DIF, and there wasn’t enough evidence to support nonuniform DIF. This suggests that the RCI-Final is a fair assessment that does not discriminate one subgroup over the other.
         
      
      
        Evidence for reliability
        Obtaining reliability evidence is situation specific (American Educational Research Association et al., 2014). Reliability evidence for the RCI-Pilot and the RCI-Final data was obtained using McDonald's omega values. It has been argued that internal consistency reliability values like Cronbach's alpha, which was commonly used, may not be a suitable reliability measure for concept inventories (Adams and Wieman, 2011). Other studies in which concept inventories have been developed have reported low internal consistency values (Bretz and Linenberger, 2012; McClary and Bretz, 2012). These studies argue that concept inventories measure student conceptual understanding, which is usually not coherent, so measures like Cronbach's alpha that are designed to elicit a highly connected structure, will be expectedly low. An alternative to Cronbach's alpha is McDonald's omega which is conceptually like alpha, but it allows each item to be associated with the common construct affecting the true value of each item to a different extent (Crocker and Algina, 1986). The RCI-Pilot has an internal consistency of 0.62 as measured by the omega coefficient and according to the level of confidence, there is a 95% probability of the true value of omega being found in the resulting interval [0.55, 0.65]. The RCI-Final has an internal consistency of 0.54 as measured by the omega coefficient and according to the level of confidence, there is a 95% probability of the true value of omega being found in the resulting interval [0.43, 0.55]. It has been argued that McDonald's omega is a better estimate of single-administration reliability than Cronbach's alpha, but there is still debate within the chemical education research community as to what are the most appropriate indicators of reliability for concept inventories. We acknowledge the low internal consistency values for the RCI-Pilot and the RCI-Final data, so users of the RCI should interpret their results with this in mind.
        Reliability at the item level was obtained using Rasch analysis through item separation and person separation reliability for both sets of data. The item separation reliability for both RCI-Pilot and RCI-Final data was found to be 0.99 which is well above the acceptable value of 0.8, indicating that is a good spread of items for different person abilities. The person separation reliability for both sets of data were 0.60 and 0.53 respectively which are below the acceptable values of 0.8. This is more difficult to achieve given that it is item dependent (Bond and Fox, 2015). Again, we acknowledge the low person separation reliability, so users of the RCI should interpret the results with this in mind.
      
    
    
      Conclusions
      This study was aimed at obtaining evidence for the quality of the RCI items, validity of the data obtained using the RCI based on internal structure, and reliability of the data obtained with the RCI. Evidence for the quality of the RCI items was supported by obtaining the difficulty and discrimination indices of the RCI-Pilot and RCI-Final items. Modifications were made to the items on the RCI-Pilot based on this information to develop the 14-item RCI-Final. Most of the items on the RCI-Final are within the acceptable ranges of difficulty and discrimination except for two items. Important information regarding student alternate conceptions was still obtained using these two items, so there was evidence to retain the items.
      Validity evidence based on internal structure was obtained using Rasch and DIF analyses. Both the RCI-Pilot and the RCI-Final data fit the Rasch model as indicated by fit statistics. The Wright maps also show a good spread of items based on difficulty with respect to student ability. The results suggest that there is good evidence for validity based on internal structure and that the RCI is measuring the intended construct of student conceptual understanding of resonance. The validity evidence based on internal structure was further confirmed by DIF analysis results. No RCI-Final items exhibited uniform or nonuniform DIF, suggesting the instrument is a fair assessment without any biases toward gender subgroups.
      Reliability evidence for the data obtained using the RCI was obtained using McDonald's omega. The omega values for the RCI-Pilot and RCI-Final data are low, and users of the assessment should be aware of this when interpreting test scores. Further evidence for reliability of the data might need to be obtained.
      
        Implications for teaching
        Given the importance of the concept of resonance in undergraduate organic chemistry and given the difficulties that students face with the concept, an instrument like the RCI would be very beneficial to detect commonly held alternate conceptions on resonance. The RCI could help with the large-scale assessment of student understanding leading to instructional modifications. The instrument can be administered immediately after the instruction of resonance as a diagnostic assessment. Instructors can then address commonly held alternate conceptions by providing additional instruction or more practice problems. Instructors can also use the information to modify the way they teach resonance to prevent students from developing these alternate conceptions. For example, instructors could make sure to point out the difference between resonance and equilibrium and consistently remind their students about this distinction. The information obtained using the RCI could also help bring about changes in the way resonance is introduced to students in general chemistry.
      
      
        Interpretation and use statement
        The resonance concept inventory (RCI)‡ measures organic chemistry students’ understanding of the concept of resonance and could assist with identifying alternate conceptions that they develop. The concept of resonance is crucial for obtaining a thorough understanding of fundamental and advanced organic chemistry content. The RCI would help organic chemistry instructors quickly identify problems that students have with understanding resonance leading to changes in their approach to teaching resonance. The RCI is a 14-item multiple-choice assessment that needs to be administered during a regular class period. There is no strict time limit, but students take approximately 20 minutes to complete the assessment. It is recommended that the RCI be used within the first month of a first-semester undergraduate organic chemistry class after the concept of resonance has been covered in class. The students can record their answers on scantrons or on bubble sheets. Dichotomous scoring is utilized where students are given 1 for every correct answer and 0 for every incorrect answer. The sum of the number correct can be used by the instructor as an interpretation of a student's overall understanding of resonance. An individual analysis of scores on each item could provide the instructor with richer information on student understanding of specific sub-topics within resonance. The RCI could be utilized as a good diagnostic tool to get better information on student understanding of resonance. The raw scores obtained from the RCI approximately correspond to the latent ability measures from the Rasch analysis as shown in Fig. 9. This indicates that the raw scores have good predictive power when obtaining information regarding student understanding of resonance.
        |  | 
|  | Fig. 9  Relationship between raw scores and Rasch ability estimates for the RCI-Final data. |  | 
Implications for research
        Assessment development is an iterative process, and validity and reliability evidence should be obtained at various stages of development. It is also important to report multiple forms of evidence for validity and reliability so that users of the instrument are well equipped to draw meaningful conclusions from the data. These different forms of validity and reliability evidence also help with the modification of items during instrument development. Rasch analysis is a good method for building a truly linear scale and obtaining internal structure validity. DIF analysis should become a routine part of concept inventory development given the major implications it has towards developing equitable instruments. The results from this study give important information regarding the evidence for validity based on internal structure and evidence for reliability of data obtained using the RCI.
      
      
        Limitations
        The pilot data was obtained from three mid-sized universities, and the final data was obtained from four mid-sized universities. More data could be collected with the RCI-Final from other universities leading to more generalizable results. Additionally, this could help improve the reliability of the data obtained using the RCI.
        DIF was only analyzed based on gender even though additional demographic information was collected from the respondents. Further validity evidence could be gathered to support the interpretation of the assessment scores with the investigation of DIF based on year of study and DIF based on major.
        Certain assumptions were made during the data analysis stage of the study that may limit the conclusions that can be drawn from the RCI. If a student left three or more items blank when taking the assessment, then the student's responses to all the items were removed from the analysis. The assumption was made that these specific students were not fully engaged, and so, their responses were removed to avoid labeling “non” answers as “incorrect” or “correct.” However, this is an assumption. Some of these students may have been engaged with the assessment. For this reason, bias may have been introduced by this selective removal of data. Specifically, the results of the DIF analysis may have been influenced by the removal of this data if respondents from a certain gender subgroup were more likely to leave three or more items blank. Only one data point on the RCI-Pilot data was removed during the analysis for this reason, but the results may have been affected. A PCA of residuals was conducted to determine whether the data suggests that the assessment reflects unidimensionality, but a PCA of residuals cannot definitively prove that an assessment is unidimensional. Thus, the conclusions drawn from the Rasch analysis can only be made on the assumption that the assessment is, indeed, unidimensional.
      
    
    
      Author contributions
      SN conceptualized the study and led the administration of the project. GCT collected and analysed the data with assistance from SN. GCT wrote the original draft of the manuscript and SN reviewed and edited it. Both authors approved the submitted version of the manuscript.
    
    
      Data availability
      Deidentified data collected from human participants, described in all tables and figures, are available as an electronic supplement to the paper. The citations for all software packages used in this study have been provided in the references section of the article.
    
    
      Conflicts of interest
      We have no conflicts of interest to declare.
    
  
    Acknowledgements
      We would like to thank all the students who took part in this study.
    
    Notes and references
      - Abell T. and Bretz S., (2019), Development of the Enthalpy and Entropy in Dissolution and Precipitation Inventory, J. Chem. Educ., 96(9), 1804–1812 DOI:10.1021/acs.jchemed.9b00186.
- Adams W. K. and Wieman C. E., (2011), Development and Validation of Instruments to Measure Learning of Expert-Like Thinking, Int. Jou. Sci. Educ., 33(9), 1289–1312 DOI:10.1080/09500693.2010.512369.
- American Educational Research Association, American Psychological Association and National Council on Measurement in Education, (2014), Standards for Educational and Psychological Testing, American Educational Research Association.
- Atkinson M., Popova M., Croisant M., Reed D. and Bretz S., (2020), Development of the Reaction Coordinate Diagram Inventory: Measuring Student Thinking and Confidence, J. Chem. Educ., 97(7), 1841–1851 DOI:10.1021/acs.jchemed.9b01186.
- Bandalos D. L., (2018), Measurement theory and applications for the social sciences, p. xxiii, 661.
- Barbera J., Naibert N., Komperda R. and Pentecost T. C., (2021), Clarity on Cronbach's Alpha Use, J. Chem. Educ., 98(2), 257–258 DOI:10.1021/acs.jchemed.0c00183.
- Betancourt-Perez R., Olivera L. and Rodriguez J., (2010), Assessment of Organic Chemistry Students’ Knowledge of Resonance-Related Structures, J. Chem. Educ., 87(5), 547–551 DOI:10.1021/ed800163g.
- Bond T. and Fox C., (2015), Applying the Rasch Model: Fundmanental Measurement in the Human Sciences, Routledge.
- Brandfonbrener P., Watts F. and Shultz G., (2021), Organic Chemistry Students’ Written Descriptions and Explanations of Resonance and Its Influence on Reactivity, J. Chem. Educ., 98(11), 3431–3441 DOI:10.1021/acs.jchemed.1c00660.
- Brandriet A. and Bretz S., (2014), The Development of the Redox Concept Inventory as a Measure of Students’ Symbolic and Particulate Redox Understandings and Confidence, J. Chem. Educ., 91(8), 1132–1144 DOI:10.1021/ed500051n.
- Bretz S. L. and Linenberger K. J., (2012), Development of the enzyme–substrate interactions concept inventory, Biochem. Mol. Bio. Educ., 40(4), 229–233 DOI:10.1002/bmb.20622.
- Bretz S. L. and Murato Mayo A. V., (2018), Development of the flame test concept inventory: measuring student thinking about atomic emission, J. Chem. Educ., 95(1), 17–27 DOI:10.1021/acs.jchemed.7b00594.
- Brown C. E., Hyslop R. M. and Barbera J., (2015), Development and analysis of an instrument to assess student understanding of GOB chemistry knowledge relevant to clinical nursing practice, Biochem. Mol. Bio. Educ., 43(1), 13–19 DOI:10.1002/bmb.20834.
- Clauser B. E. and Mazor K. M., (1998), Using Statistical Procedures to Identify Differentially Functioning Test Items, Educ. Meas.: Issues Pract., 17(1), 31–44 DOI:10.1111/j.1745-3992.1998.tb00619.x.
- Cole N. S., (1997), The ETS gender study: How females and males perform in educational settings.
- Cook D. A. and Beckman T. J., (2006), Current Concepts in Validity and Reliability for Psychometric Instruments: Theory and Application, Am. Jou. Med., 119(2), 166.e7–166.e16 DOI:10.1016/j.amjmed.2005.10.036.
- Crawford M. and Marecek J., (1989), Feminist theory, feminist psychology: a bibliography of epistemology, critical analysis, and applications, Psychol. Women Q., 13(4), 477–491 DOI:10.1111/j.1471-6402.1989.tb01015.x.
- Crocker L. and Algina J., (1986), Introduction to Classical and Modern Test Theory, Harcourt.
- Dick-Perez M., Luxford C., Windus T. and Holme T., (2016), A Quantum Chemistry Concept Inventory for Physical Chemistry Classes, J. Chem. Educ., 93(4), 605–612 DOI:10.1021/acs.jchemed.5b00781.
- Doran R. L., (1980), Basic measurement and evaluation of science instruction, National Science Teachers Association.
- Duis J., (2011), Organic Chemistry Educators’ Perspectives on Fundamental Concepts and Misconceptions: An Exploratory Study, J. Chem. Educ., 88(3), 346–350 DOI:10.1021/ed1007266.
- Halpern D. F., (1997), Sex differences in intelligence: implications for education, Am. Psych., 52(10), 1091.
- Hambleton R. K., Swaminathan H. and Rogers H. J., (1991), Fundamentals of item response theory, Sage.
- Holland P. W. and Thayer D. T., (1988), Differential item performance and the Mantel–Haenszel procedure, Test Validity, 129–145 DOI:10.1037/14047-004.
- Kane M. T., (2013), Validating the Interpretations and Uses of Test Scores, J. Educ. Meas., 50(1), 1–73 DOI:10.1111/jedm.12000.
- Kaplan R. M. and Saccuzzo D. P., (1997), Psychological testing: Principles, applications, and issues, 4th edn, Thomson Brooks: Cole Publishing Co.
- Kelley T. L., (1939), The selection of upper and lower groups for the validation of test items, J. Educ. Psych., 30(1), 17–24 DOI:10.1037/h0057123.
- Kelley K. and Lai. K., (2017), The MBESS R Package version 4.2.0, Retrieved from https://cran.r-project.org/web/packages/MBESS/MBESS.pdf.
- Kendhammer L. K. and Murphy K. L., (2014), Innovative uses of assessments for teaching and research, Innovative Uses of Assessments for Teaching and Research, ACS Publications, pp. 1–4.
- Kim T., Wright L. K. and Miller K., (2019), An examination of students’ perceptions of the Kekulé resonance representation using a perceptual learning theory lens, Chem. Educ. Res. Pract., 20(4), 659–666 10.1039/C9RP00009G.
- Komperda R., Pentecost T. C. and Barbera J., (2018), Moving beyond Alpha: A Primer on Alternative Sources of Single-Administration Reliability Evidence for Quantitative Chemistry Education Research, J. Chem. Educ., 95(9), 1477–1491 DOI:10.1021/acs.jchemed.8b00220.
- Leontyev A., (2015), Development of a stereochemistry concept inventory. PhD dissertation, University of Northern Colorado, Greeley, CO, http://digscholarship.unco.edu/dissertations/33.
- Libarkin J., (2008), Concept inventories in higher education science, BOSE Conf., pp. 1–10.
- Linacre J. M., (2023), Winsteps (Version 5.6.4) [Computer Software], Portland, Oregon.
- Lindell R. S., Peak E. and Foster T. M., (2007), Are They All Created Equal? A Comparison of Different Concept Inventory Development Methodologies, AIP Conf. Proc., 883(1), 14–17 DOI:10.1063/1.2508680.
- Maccoby E. E. and Jacklin C. N., (1974), The psychology of sex differences, The psychology of sex differences, p. xiii, 634.
- Magis D., Beland S. and Raiche G., (2020), difR: collection of methods to detect dichotomous differential item functioning (DIF) in psychometrics, R package version 4.7.
- Mantel N. and Haenszel
W., (1959), Statistical aspects of the analysis of data from retrospective studies of disease, J. Natl. Cancer Inst., 22(4), 719–748.
- Martinková P., Drabinová A., Liaw Y.-L., Sanders E. A., McFarland J. L. and Price R. M., (2017), Checking Equity: Why Differential Item Functioning Analysis Should Be a Routine Part of Developing Conceptual Assessments, LSE, 16(2), rm2 DOI:10.1187/cbe.16-10-0307.
- McClary L. and Bretz S., (2012), Development and Assessment of A Diagnostic Tool to Identify Organic Chemistry Students’ Alternative Conceptions Related to Acid Strength, Int. J. Sci. Educ., 34(15), 2317–2341 DOI:10.1080/09500693.2012.684433.
- Mulford D. R. and Robinson W. R., (2002), An Inventory for Alternate Conceptions among First-Semester General Chemistry Students, J. Chem. Educ., 79(6), 739 DOI:10.1021/ed079p739.
- National Science Foundation, (2003), Program for Gender Diversity in Science, Technology, Engineering, and Mathematics Education, https://www.nsf.gov/pubs/2003/nsf03502/nsf03502.htm.
- Nedungadi S. and Brown C. E., (2021), Thinking like an electron: concepts pertinent to developing proficiency in organic reaction mechanisms, Chem. Teach. Int., 3(1), 9–17 DOI:10.1515/cti-2019-0020.
- Nedungadi S., Brown C. E. and Paek S. H., (2022), Differential Item Functioning Analysis of the Fundamental Concepts for Organic Reaction Mechanisms Inventory, J. Chem. Educ., 99(8), 2834–2842 DOI:10.1021/acs.jchemed.2c00242.
- Nedungadi S., Mosher M. D., Paek S. H., Hyslop R. M. and Brown C. E., (2021), Development and psychometric analysis of an inventory of fundamental concepts for understanding organic reaction mechanisms, Chem. Teach. Int., 3(4), 377–390 DOI:10.1515/cti-2021-0009.
- Osterlind S. J. and Everson H. T., (2009), Differential Item Functioning: Quantitative Applications in the Social Sciences, 2nd edn, Sage: Thousand Oaks, vol. 161.
- Rasch G., (1980), Probabilistic Models for Some Intelligence and Attainment Tests, University of Chicago Press.
- R Core Team, (2023), R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, https://www.R-project.org/>.
- Revelle W., (2024). psych: Procedures for Psychological, Psychometric, and Personality Research, Evanston, Illinois: Northwestern University, R package version 2.4.6, https://CRAN.R-project.org/package=psych.
- Rhemtulla M., van Bork R. and Borsboom D., (2020), Worse than measurement error: consequences of inappropriate latent variable measurement models, Psychol. Methods, 25(1), 30–45 DOI:10.1037/met0000220.
- Rizopoulos D., (2022), Itm: Latent trait models under IRT, R package Ver 1.2-0.
- Sands D., Parker M., Hedgeland H., Jordan S. and Galloway R., (2018), Using concept inventories to measure understanding, Higher Educ. Pedagogies, 3(1), 173–182 DOI:10.1080/23752696.2018.1433546.
- Stemler S. and Naples A., (2021), Rasch Measurement v. Item Response Theory: Knowing When to Cross the Line, Pract. Assess., 26, 2021.
- Swaminathan H. and Rogers H. J., (1990), Detecting differential item functioning using logistic regression procedures, J. Educ. Meas., 27(4), 361–370 DOI:10.1111/j.1745-3984.1990.tb00754.x.
- Tetschner G. C. and Nedungadi S., (2023), Obtaining Validity Evidence During the Design and Development of a Resonance Concept Inventory, J. Chem. Educ., 100(1), 2795–3805 DOI:10.1021/acs.jchemed.3c00335.
- The jamovi project, (2024), jamovi (Version 2.5) [Computer Software], Retrieved from https://www.jamovi.org.
- Villafañe S. M., Loertscher J., Minderhout V. and Lewis J. E., (2011), Uncovering students’ incorrect ideas about foundational concepts for biochemistry, Chem. Educ. Res. Pract., 12(2), 210–218 10.1039/C1RP90026A.
- Walker C. M., (2011), What's the DIF? Why differential item functioning analyses are an important part of instrument development and validation, J. Psych. Assess., 29(4), 364–376.
- Weisstein N., (1968), Kinder, kuche, kirche as scientific law: Psychology constructs the female, Boston, MA: New England Free Press.
- Widarti H. R., Retnosari R. and Marfu’ah S., (2017), Misconception of pre-service chemistry teachers about the concept of resonances in organic chemistry course, AIP Conf. Proc., 1868(1, 4th International Conference on Research, Implementation, and Education of Mathematics and Sciences, 2017), 030014/1–030014/10 DOI:10.1063/1.4995113.
- Wren D. and Barbera J., (2013), Gathering Evidence for Validity during the Design, Development, and Qualitative Evaluation of Thermochemistry Concept Inventory Items, J. Chem. Educ., 90(12), 1590–1601 DOI:10.1021/ed400384g.
- Wren D. and Barbera J., (2014), Psychometric analysis of the thermochemistry concept inventory, Chem. Educ. Res. Pract., 15(3), 380–390 10.1039/C3RP00170A.
- Xue D. and Stains M., (2020), Exploring Students’ Understanding of Resonance and Its Relationship to Instruction, J. Chem. Educ., 97(4), 894–902 DOI:10.1021/acs.jchemed.0c00066.
| Footnotes | 
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4rp00170b | 
| ‡ The RCI will be available from the corresponding author upon request. | 
| 
 | 
| This journal is © The Royal Society of Chemistry 2025 | 
Click here to see how this site uses Cookies. View our privacy policy here.