Katherine
Lazenby
ac,
Kristin
Tenney
b,
Tina A.
Marcroft
b and
Regis
Komperda
*bc
aNWEA, 121 NW Everett St., Portland, OR 97209, USA
bCenter for Research in Mathematics and Science Education, San Diego State University, USA
cDepartment of Chemistry & Biochemistry, San Diego State University, USA. E-mail: rkomperda@sdsu.edu
First published on 3rd March 2023
Assessment instruments that generate quantitative data on attributes (cognitive, affective, behavioral, etc.) of participants are commonly used in the chemistry education community to draw conclusions in research studies or inform practice. Recently, articles and editorials have stressed the importance of providing evidence for the validity and reliability of data collected with these instruments following guidance from the Standards for Educational and Psychological Testing. This study examines how quantitative instruments have been used in the journal Chemistry Education Research and Practice (CERP) from 2010–2021. Of the 369 unique researcher-developed instruments used during this time frame, the majority only appeared in a single publication (89.7%) and were rarely reused. Cognitive topics were the most common target of the instruments (56.6%). Validity and/or reliability evidence was provided in 64.4% of instances where instruments were used in CERP publications. The most frequently reported evidence was single administration reliability (e.g., coefficient alpha), appearing in 47.9% of instances. Only 37.2% of instances reported evidence of both validity and reliability. These results indicate that, as a field, opportunities exist to increase the amount of validity and reliability evidence available for data collected with instruments and that reusing instruments may be one method of increasing this type of data quality evidence for instruments used by the chemistry education community.
In CER, assessment instruments can be used to generate data related to cognitive knowledge of chemistry topics (e.g., concept inventories, American Chemical Society exams, instructor-developed exams, or other measures of chemistry content knowledge), affect (e.g., measures of student attitudes, beliefs, etc.), and behavior (e.g., observational protocols of student [or instructor] actions). Broadly defined, assessment instruments are tools used in social science research to quantitatively measure psycho-social attributes of research participants. While the psycho-social attributes that education researchers often seek to measure, such as knowledge and beliefs, are not directly observable, assessment instruments (henceforth referred to as instruments) allow researchers to collect data related to non-directly-observable attributes; these attributes are described as latent traits or constructs (American Educational Research Association et al., 2014; Wu et al., 2016) To support the use of quantitative data generated through the use of measurement instruments designed to measure latent traits, authors should provide “evidence that supports the interpretation and use of the data” for its intended purposes (Lewis, 2022, p. 1).
Because of the important role of instruments in generating research data and supporting the CER community's goals for improving educational experiences for students, it is important to understand how the research community uses and evaluates the quality of instruments and instrument-generated data (Lewis, 2022; Stains, 2022). In CER, both the international and United States flagship journals (this journal and the Journal of Chemical Education, respectively) require studies using instruments to provide evidence of the validity and reliability of instrument-generated data (Towns, 2013; Seery et al., 2019). Additionally, a number of chemistry education scholars have advocated for field-wide adoption of best practices in measurement (Barbera and VandenPlas, 2011; Arjoon et al., 2013; Komperda et al., 2018; Taber, 2018; Barbera et al., 2020; Rocabado et al., 2020). While field- and journal-specific recommendations provide guidance for researchers wishing to provide validity evidence for instrument-generated data, there is no single correct approach to doing so. Reviews in CER have highlighted the variation in researchers’ approach to gathering validity evidence and presenting it in published work (Arjoon et al., 2013; Deng et al., 2021). This research adds to that body of literature by examining trends in instrument development, use, and evaluation in Chemistry Education Research and Practice over a twelve-year period (2010–2021).
Before describing the details of the current study, it is necessary to describe the terminology used to describe the quality of data collected from instruments, typically categorized as validity or reliability evidence. Additionally, this section addresses item difficulty and discrimination values, which are not considered by the Standards to be sources of validity or reliability evidence but can provide important psychometric information for instrument developers and users. Our conceptual framework uses definitions and operationalizations from the Standards and Arjoon et al. (2013), including five types of validity evidence (evidence based on test content, response processes of respondents, internal structure, relations to other variables, and consequences of testing), three categories of reliability evidence (test-retest coefficients, single-administration coefficients, and other less frequently used estimates), and the role of item difficulty/discrimination in CER.
Evidence of measurement invariance refers to evidence that item interrelationships do not differ across subgroups (e.g., racial, ethnic, gender groups, experimental conditions). Differential Item Functioning across subgroups is known as DIF. Usually, DIF represents the measurement of an unintended dimension and is a threat to the validity of interpretations from instrument-generated data, though sometimes DIF is anticipated (American Educational Research Association et al., 2014; Rocabado et al., 2020). In our analysis, we code evidence of measurement invariance/DIF separately from other types of evidence based on internal structure (e.g., factor analyses).
Chemistry Education Research and Practice (CERP) is a peer-reviewed international academic journal published by the Royal Society of Chemistry and publishes articles which “inform readers about some aspect of teaching and learning chemistry” (Seery et al., 2019, p. 335). CERP was selected for this study as it is freely accessible, all articles published in CERP are expected to include typical components of research articles (description of methods, results, connections to prior literature) as well as implications for the practice of teaching chemistry, and CERP publishes a “diverse range of contributions from all over the world,” (p. 338); therefore it is expected to broadly represent research and research practices in chemistry education for the purposes of this study. In this study, we present an investigation of the adoption of best practices in measurement and evaluation, as described by the Standards, in a census of journal articles published in CERP from 2010 to 2021. The study was guided by the following research questions:
(1) How are measurement instruments used and/or evaluated in studies reported in CERP?
(2) To what extent do CER researchers provide psychometric evidence for instrument data, as reported in CERP?
Articles considered for inclusion in the study must meet any of the following criteria and may meet multiple criteria:
1. The article reports the development of a novel instrument. [Original]
2. The article reports the modification of an existing instrument. [Modification]
3. The article reports the use of a novel or existing instrument; the data generated by the instrument is used to address research question(s). [Use]
4. The article reports evidence related to the quality of data generated using a novel or existing instrument. [Evaluation]
Additionally, instruments must generate quantitative data or qualitative data that may be scored quantitatively (e.g., assigned a numeric value according to a coding structure described by the article's authors). Articles that do not meaningfully use assessment instruments and articles using instruments that generate only qualitative data were not included in the study. Studies using qualitative data only were excluded for the purposes of limiting the scope of this study and because in the qualitative tradition, the “researcher may be considered the instrument” whereas instruments that generate quantitative data may be considered separately from the researchers conducting the study (Arjoon et al., 2013, p. 536).
We recorded articles which used standardized exams as data collection instruments (e.g., ACT, SAT, American Chemical Society [ACS] Exams), but we do not include these instances in our analysis for this study as the generation of psychometric evidence for those instruments is generally conducted and reported by the publisher, not individual researchers. We have included a parallel analysis that includes these instruments in the Appendices (Tables 5–8 and Fig. 4 and 5).
1. Publication-level variables (Npublications = 292)
2. Instrument-level variables (Ninstruments = 369)
3. Publication-instrument-level variables (Npublication-instrument = 430)
Variables were recorded separately at three levels because some variables of interest apply to only publications (e.g., title of the article) or only instruments (e.g., name of the instrument); publication-instrument-level variables were used to record the relational aspects of publications and instruments (e.g., evidence of validity and reliability of instrument-generated data in a specific study).
We developed a coding protocol based on the Standards’ descriptions and definitions of (1) the process of instrument development, evaluation, and use and (2) validity and reliability evidence (American Educational Research Association et al., 2014); because these definitions have already been operationalized for use in CER, our coding protocol was also adopted from Arjoon et al. (2013). The definitions and operationalization of terms and concepts used in this study were largely unchanged between the 1999 and 2014 versions of the Standards, and therefore can be fairly applied to the analysis of publications within the study's focal timespan (2010–2021). The codebook definitions follow the conceptual framework described previously in this manuscript and the complete codebook is available in the OSF pre-registration materials.
The results presented here include analysis of data primarily at the publication-instrument level because we conceptualize publication-instrument instances as a proxy for the measurement targets of CER as a field as well as an indication of the extent to which best practices for reporting data quality evidence with each instrument administration are being followed.
Research Question 1: How are measurement instruments used and/or evaluated in studies reported in CERP?
RQ 1.1 Which researcher-developed instruments (if any) are commonly used in studies reported in CERP?
To gain a general sense of how instruments are used in CER, we investigated which instruments were administered in multiple studies. Most researcher-developed instruments which appeared in multiple publications (n ≥ 3) were designed to measure (1) affective constructs such as: attitude, self-efficacy, interest, and motivation, and (2) students’ process skills such as: information processing, scientific or logical reasoning, and visio-spatial thinking (Table 1) (Tobin and Capie, 1984; Geban et al., 1992; Dalgety et al., 2003; Bauer, 2008; Stamovlasis, 2010; Xu and Lewis, 2011; Cooper et al., 2012; Ferrell and Barbera, 2015; Ardura and Pérez-Bitrián, 2018; Galloway and Bretz, 2015).
Researcher-developed instrument | Number of uses | Topic(s) |
---|---|---|
Attitude toward the subject of chemistry v2 | 8 | Affective – attitudes |
Attitude toward the subject of chemistry | 4 | Affective – attitudes |
College chemistry self-efficacy scale – cognitive skills scale | 4 | Affective – self-efficacy |
Lawson's classroom test of formal reasoning (Greek) | 4 | Process skills – scientific reasoning |
Chemistry attitudes and experiences questionnaire | 3 | Affective – attitudes, self-efficacy |
Initial and maintained interest in chemistry | 3 | Affective – interest |
Spanish translation of the science motivation questionnaire II | 3 | Affective – motivation |
Meaningful learning in the laboratory instrument | 3 | Affective – attitudes cognitive – self-assessed laboratory skills |
Scientific process skills test (Turkish) | 3 | Process skills |
Implicit information from Lewis structures | 3 | Process skills – information processing |
Test of logical thinking | 3 | Process skills – logical thinking |
Group embedded figures test (Greek) | 3 | Visio-spatial thinking |
Reformed teaching observational protocol | 3 | Teaching strategies/behaviors |
Students’ understanding of models in science | 3 | Nature of science (models) |
Students’ assessment of learning gains | 3 | Self-assessed learning gains |
Other instruments not targeting affective constructs or process skills which appeared in three publications include the Reformed Teaching Observational Protocol (RTOP), the Students’ Understanding of Models in Science (SUMS), and the Students’ Assessment of Learning Gains (SALG) (Piburn et al., 2000; Seymour et al., 2000; Treagust et al., 2002). All other instruments appeared in only one (n = 331) or two (n = 23) publications; most commonly, instruments observed twice appeared in two publications by the same author(s) (n = 17). The observed count of all publication-instrument instances can be found in the provided data and R code.
We observed that of the instruments that appeared in three or more publications, only the Implicit Information from Lewis Structures Instrument (Cooper et al., 2012) assessed chemistry-specific cognitive knowledge/skills while the Meaningful Learning in the Laboratory Instrument (Galloway and Bretz, 2015) bridges the affective and self-reported laboratory skills domain. We found this observation curious, given that the data for this study are publications in a chemistry education-specific journal, though potentially reflective of practice. It may be that those publishing articles in CER feel more competent developing instruments to measure chemistry knowledge (cognitive) than instruments that measure constructs (e.g., affective constructs) that have been more extensively studied in adjacent fields (e.g., educational psychology); therefore, researchers are more likely to reuse instruments from adjacent fields, while they might develop novel instruments for cognitive measurement goals. Additionally, researchers might use ACS exams (or other instruments developed by testing organizations) for cognitive measurement targets; while these instruments were excluded from the remainder of our analyses, we do note that standardized exams were used as instruments in studies published in CERP with some frequency. ACS exams (any version) were administered in eleven studies, and student scores on the SAT from any year were used in nine studies. No other excluded instrument was used more than once. We have included results from parallel analyses, which include these excluded instruments, in Appendix 1 (Table 5).
RQ 1.2 Which topics are commonly measured for studies reported in CERP?
In our analysis related to RQ 1.1, we observed that the most commonly administered researcher-developed instruments are those designed to measure affective constructs and process skills, and only one (IILSI) of the most common instruments is designed to measure chemistry-specific knowledge. Based on this observation, we investigated whether/to what extent the measurement targets in CERP studies also prioritized measurement of affective constructs and process skills.
In this analysis, we observed a mismatch between the measurement targets overall (Table 2) and those for the most commonly reused researcher-developed instruments (Table 1). The target construct for more than half of all publication-instrument instances (Npublication-instrument = 430) was coded as cognitive (54.9%, n = 236), while affective measurement targets represented only 34.2% of all publication-instrument instances (n = 147). Publication-instrument instances which were coded as Behavioral (9.8%, n = 42), Metacognition (3.3%, n = 14), Evaluation (3.0%, n = 13), and Nature of Science (0.5%, n = 2) represented small proportions of measurement targets. Less than five percent of instruments were designed to measure targets not represented in our coding scheme. The proportions of unique instruments designed to measure each target topic domain are largely very similar (Table 3). See OSF materials for additional details on the coding scheme.
Topic | Percent publication-instrument instances (N = 430) |
---|---|
Cognitive | 54.9% (n = 236) |
Affective | 34.2% (n = 147) |
Behavioral | 9.8% (n = 42) |
Metacognition | 3.3% (n = 14) |
Evaluation | 3.0% (n = 13) |
Nature of science | 0.5% (n = 2) |
Other | 4.9% (n = 21) |
Topic | Percent unique instruments (N = 369) |
---|---|
Cognitive | 56.6% (n = 209) |
Affective | 31.2% (n = 115) |
Behavioral | 9.8% (n = 36) |
Metacognition | 3.5% (n = 13) |
Evaluation | 3.0% (n = 11) |
Nature of science | 0.5% (n = 2) |
Other | 4.9% (n = 18) |
Though CER has historically focused on the cognitive domain, we were interested in investigating if research (measurement targets) has diversified over the last decade to include more research on the affective domain and other domains, such as behavior and metacognition. Therefore, we disaggregated the previous analysis by year of publication to examine any trends in measurement targets by topic/domain (Fig. 1). We observed moderate diversification of the topics of measurement instruments, including a small increase in the count and proportion of publication-instrument instances measuring topics in the affective domain. Generally, there were no clear trends in measurement targets over time.
![]() | ||
Fig. 1 Researcher-developed instruments in publication-instrument instances (N = 430) by topic by year as count (top) and proportion (bottom). |
RQ 1.3 What is the ratio of CERP studies using instruments previously developed by other researchers relative to the development of novel instruments for their research purposes?
Data addressing RQ 1.1 suggest that researcher-developed instruments are often developed or modified for the purposes of a specific study, rarely used again, and very infrequently administered by researchers other than the original authors. To further support this claim, we investigated the nature of the publication-instrument instances by coding for the nature of the instrument appearance in publications.
Codes and code definitions are defined by the inclusion criteria in the Methods section. All applicable codes were applied to publication-instrument instances; because meaningful use of assessment instruments, operationalized by this coding structure, was a criterion for inclusion in the study, all publication-instrument instances were assigned at least one code. The codes are not mutually exclusive categories and publication-instrument instances could receive multiple codes. Fig. 2 represents the extent of the overlap between the codes. As seen in the top of Fig. 2, almost all publication-instrument instances were coded as Use. The code for Modification is not specifically visualized because it is always a subset completely contained within Original as all Modifications of one instrument also represent the Original publication for the new modified instrument.
Of the 430 publication-instrument instances, 70.7% are indicated as an Original instrument (n = 304; Fig. 2 – Section A); this indicates that researchers developed a novel instrument for most quantitative measurement goals in CERP studies. Some of these Original instruments, 41.4% (n = 126), were modified by the researchers from existing instruments for the purposes of their study (not shown in Fig. 2). Of the 292 publications included in the study, 80.5% included at least one Original instrument. Together, these data may indicate that researchers are interested in studying and measuring constructs for which appropriate instruments do not already exist and/or researchers are unaware that potentially useful instruments already exist, so they develop new instruments for their research purposes.
We also investigated the co-occurrence of codes for the nature of the publication-instrument instances (Npublication-instrument = 430). For 64.4% (n = 277) of publication-instrument instances, authors provided some evidence related to the validity and/or reliability of instrument-generated data, shown in the top of Fig. 2 as Evaluation. Authors were more likely to provide evidence of data validity or reliability for Original instruments (71.1%; n = 216; Fig. 2 – Section C) than pre-existing instruments (48.4%; n = 61; Fig. 2 – Section D). Nearly all observed publication-instrument instances involved the use of instrument-generated data to address research question(s) (use; n = 400; 93.0%; Fig. 2 – Section G); this is unsurprising, as a primary purpose of instruments in research is the measurement of constructs through the generation of data. Of the studies that used instrument-generated data to address research questions, 37.8% (n = 151; Fig. 2 – Sections E and F) provided no evidence of data validity or reliability; these instances where instruments were used to generate data without evaluation included both Original (n = 86; Fig. 2 – Section E) and pre-existing (n = 65; Fig. 2 – Section F) instruments.
Publication-instrument instances which were not coded as use (Fig. 2 – Section H) include the studies which investigated the psychometric properties of newly developed (n = 24) and pre-existing (n = 6) instruments but did not use instrument-generated data to address other research questions. While field- and journal-specific recommendations (Towns, 2013; Seery et al., 2019) and documents like the Standards (American Educational Research Association et al., 2014), provide some guidance for researchers on how to collect and report validity and reliability evidence for instrument-generated data, there is no single approach to doing so. In the above analysis, the publication-instrument instances that were coded as Evaluation varied considerably in the types and amount of evidence presented. We investigated this variation in the following analyses.
Research Question 2: To what extent do CER researchers provide psychometric evidence for instrument data, as reported in CERP?
RQ 2.1 To what extent is data quality evidence reported in CERP studies? When data quality evidence is reported, what kind of evidence is typically reported?
In our analysis for RQ 1.3, we observed that approximately two-thirds of all publication-instrument instances (Npublication-instrument = 430) reported some evaluation of data quality evidence. There is no standard approach to collecting and reporting validity and reliability evidence, and for this analysis, we recorded the types of validity and reliability evidence reported in CERP studies. Our coding structure was developed based on the Standards, described in the conceptual framework; code definitions are also included in the OSF materials.
Authors most commonly reported validity evidence based on internal structure (24.9%), test content (24.4%), and relations with other variables (23.0%). Validity evidence based on response processes (12%) was less commonly reported. While evidence based on internal structure (e.g., factor analysis, principal component analysis, Rasch analysis) was reported in nearly a quarter of all studies, authors rarely reported evidence of measurement invariance or DIF (3.0%). We observed no instances of consequence testing, and therefore, consequence testing does not appear in our analysis. Nearly half of all studies (three-quarters of all evaluations) reported a single administration reliability coefficient (e.g., coefficient alpha, McDonald's Omega). Test-retest reliability was reported much less frequently (Table 4). G-theory approaches to reliability estimation were not observed and are therefore not reflected in this analysis.
Type of psychometric evidence | Percent of publication-instrument instances (N = 430) |
---|---|
Evaluation (any) | 64.4% (n = 277) |
Validity (n = 218) | |
Internal structure (factor analysis/principal components analysis/rasch analysis) | 24.9% (n = 107) |
Test content | 24.4% (n = 105) |
Relations with other variables | 23.0% (n = 99) |
Response processes | 12% (n = 50) |
Measurement invariance/DIF | 3.0% (n = 13) |
Reliability (n = 209) | |
Single administration reliability | 47.9% (n = 206) |
Retest reliability | 1.4% (n = 6) |
The role of validity evidence is to justify the use of instruments for specified purposes. Because of the variety of instruments and circumstances in which they are used, it is “natural that some types of evidence will be especially critical in a given case, whereas other types will be less useful” (American Educational Research Association et al., 2014, p. 12). However, multiple, complimentary types of validity and reliability evidence support the proposition that conclusions from instrument-generated data are trustworthy. In this study, half of instrument administrations were used without any reported evidence of validity (n = 212). Of those studies that did provide some validity evidence (n = 218), more than half reported only one source of validity evidence (n = 118). One hundred studies reported two or more complementary sources of validity evidence.
Similarly, we observed no reported evidence of reliability for more than half of the publication-instrument instances (n = 221). Because validity and reliability are complimentary constructs, the Standards suggest that evidence of both are required to support the trustworthiness of conclusions based on instrument-generated data. In this study, just 37.2% (n = 160) of publication-instrument instances reported evidence of both validity and reliability. Some reported evidence of validity and no evidence of reliability (n = 58); others reported only reliability coefficients without evidence of validity (n = 49).
We also observed that difficulty and discrimination indices were reported in some studies as evidence of data quality. These indices are not considered evidence of either validity or reliability in our conceptual framework, but they are somewhat commonly reported in CER literature. These values can help inform test developers and users about the extent to which items are functioning as intended with the population being measured by the instrument. The Standards describe item difficulty and discrimination values as part of the item screening process that contributes to overall instrument development. Difficulty indices (based on either Classical Test Theory [CTT] or IRT/Rasch) were reported in 7.9% of studies (n = 34); discrimination indices were reported in 6.7% (n = 29) studies.
2.2 What trends exist in the data quality evidence reported?
In acknowledgement of the relative youth of CER as a discipline and a push by field leaders to improve the quality of research, including some calls for improved practice in instrument evaluation, we investigated the types of validity and reliability evidence for researcher-developed instruments presented over the 12 years of CERP included in this study (Fig. 3). We expected that instrument evaluation practices might improve with the maturity of CER as a discipline.
![]() | ||
Fig. 3 Proportion of publication-instrument instances (N = 430) that reported types of validity evidence, either qualitative (top) or quantitative (middle) or reliability evidence (bottom) by year. |
Starting with validity evidence that is typically collected using qualitative methods (test content and response process), no clear trend was observed over time. We did observe trends in reported quantitative validity evidence. There was an increase in the reported use of factor analysis (and similar methods) to investigate the internal structure of instruments; similarly, measurement invariance has been reported as a source of validity evidence more frequently since 2019. Of note, in 2019, CERP published an editorial intended “to provide guidance on submitting manuscripts” to the journal, which formally set the expectation that authors include evidence related to validity and reliability in studies using instruments to generate quantitative data (Seery et al., 2019, p. 355). We expect that practices will continue to improve to meet this standard, and we see some evidence that this is the case already (e.g., an observed uptick in measurement invariance/DIF).
The second, and more salient, observed trend is the parallel between the proportion of publication-instrument instances that are coded as Evaluation and Single Administration Reliability. Others have criticized researchers’ overreliance on single administration reliability coefficients, in particular, coefficient alpha (Komperda et al., 2018; Taber, 2018; Barbera et al., 2020). An overwhelming majority of the publication-instrument instances (Npublication-instrument = 430) that are coded as Single Administration Reliability (n = 206) reported coefficient alpha (n = 164). While coefficient alpha can be used to estimate reliability of data generated using instruments, we encourage readers to use alpha (and other reliability estimates) only when mathematical assumptions are met and not as a substitution for validity evidence. A plot of difficulty and discrimination indices by year can be found in Appendix 3 (Fig. 5).
We recorded the types of validity and reliability evidence that authors reported to support their inferences based on instrument-generated data. Our analyses suggest an overreliance on single administration reliability estimates (specifically, coefficient alpha) as evidence of data quality and demonstrate that researchers rarely present multiple complimentary sources of validity evidence. Our finding that roughly 50% of the publication-instrument instances report alpha is consistent with findings from Arjoon et al. (2013) where 75% of instruments examined reported alpha, making it much more prevalent than test-retest reliability. For validity, our analysis found roughly equivalent use of relations with other variables, test context, and internal structure evidence in approximately 25% of the publication-instrument instances. This is noticeably different from the Arjoon et al. (2013) results which found mostly relations with other variables (95%), followed by test content (55%), and internal structure (45%).
Though we noted the absence of validity evidence as it relates to consequences of testing, overall, practices in reporting evidence of data quality appear to be improving, perhaps in response to recommendations from field leaders in measurement (Arjoon et al., 2013) and explicit expectations from both CERP and the Journal of Chemical Education (Lewis, 2022; Seery et al., 2019; Stains, 2022; Towns, 2013). Based on our findings, we make the following recommendations:
Recommendation 1: Consider using instruments that have already been developed and published alongside evidence of validity and reliability
We recorded 369 unique researcher-developed instruments in our analysis of 296 publications over twelve years (2010–2021) in Chemistry Education Research and Practice. Because of the nature of research, it is inevitable that sometimes researchers will have measurement goals that require the development of new instruments. However, we encourage researchers to consider instruments that have already been developed and evaluated for their research purposes. The use of existing instruments, where appropriate, will contribute to the body of evidence for validity and reliability of instrument-generated data and will allow for the allocation of research resources to endeavors other than development of redundant instruments. Other studies which have investigated instrument use and evaluation practices have also recommended that researchers consider extant instruments and contribute to the body of evidence supporting instruments’ use across contexts (Blalock et al., 2008; Arjoon et al., 2013).
To support researchers (and practitioners) in choosing among the many extant instruments, this research team and our colleagues have developed an online resource, the Chemistry Instrument Review and Assessment Library (CHIRAL; Barbera et al., 2022). CHIRAL can be accessed at https://chiral.chemedx.org/ and has been populated with information about instruments, the publications that instruments appear in, and published validity and reliability evidence, including for all instruments identified in this study.
Recommendation 2: Collect and publish evidence of validity and reliability in all studies that base conclusions on instrument-generated data
In this study, we observed that researchers were more likely to present evidence of data validity and reliability when using novel instruments for data generation, compared to when using existing instruments. We encourage the field to always evaluate data for validity and reliability, which is aligned with the notion that evaluation of instruments is the responsibility of both instrument developers and users (American Educational Research Association et al., 2014; Lewis, 2022; Stains, 2022). We emphasize that some approaches for evaluating the quality of data are inappropriate or impossible in some cases; for example, it would be inappropriate to perform factor analyses with very small datasets. However, researchers should consider sources of validity evidence that are appropriate for their research contexts; for example, conducting response process interviews with students from the target population is both possible and appropriate for studies with small datasets. Collecting evidence of data quality should be considered from the outset of a study and included in the study design (Stains, 2022), and multiple resources exist to support researchers in collecting and making sense of data quality evidence (Arjoon et al., 2013; Komperda et al., 2018; Rocabado et al., 2020; Deng et al., 2021). A field-wide commitment to collection and publication of data quality evidence for all studies will support the trustworthiness and the impact of research in chemistry education.
Recommendation 3: Include information on the collection of data quality evidence in a detailed methods section
One limitation of this study is that our analyses and codes were constrained by the information that researchers opted to include in their published articles. During data collection, we found that researchers adopt a range of approaches for describing their efforts to evaluate instruments and instrument-generated data. Sometimes, our data collection was complicated by the scattering of validity and reliability evidence throughout multiple sections of an article or inclusion of this evidence only in the supplementary information, without mention in the body of the article. Sometimes, authors reference prior evaluation efforts ambiguously, and we found it difficult to distinguish between discussion of prior evaluation efforts and the authors’ own efforts. If authors are relying exclusively on prior evaluation efforts to support their case that data are valid and reliable (which we do not recommend), this should be clearly stated.
It is possible that some researchers opted to not include relevant data quality evidence in their published work due to space constraints or other research, and therefore we (and future instrument users) are unaware of these efforts. We encourage researchers to explicitly and intentionally include details on their approaches to evaluating instruments and instrument-generated data in methods sections and for reviewers and editors to recommend this inclusion. If these details must be presented in ESI, authors should direct the interested reader to those materials.
Additionally, the language around validity and reliability has changed over time and is often ambiguous. This complicated our efforts to code the type of data quality evidence presented in publications. Authors should consider adopting formal terms, for example from the Standards, as in this study and others (Arjoon et al., 2013; American Educational Research Association et al., 2014), which will support more universal understanding of methods and approaches to evaluation. Additionally, authors should include all relevant details about the target population(s) in their studies, including participant characteristics (age, course level) and context (language used in the classroom, country in which the study was conducted). The inclusion of such relevant details can support readers’ interpretations and evaluation of the relevance of research relative to other contexts (Stains, 2022).
Researcher-developed instrument | Number of uses | Topic(s) |
---|---|---|
American chemical society exams – any | 11 | Cognitive knowledge of chemistry topics (varies) |
SAT | 9 | Cognitive knowledge of multiple topics |
Attitude toward the subject of chemistry v2 | 8 | Affective – attitudes |
Attitude toward the subject of chemistry | 4 | Affective – attitudes |
College chemistry self-efficacy scale – cognitive skills scale | 4 | Affective – self-efficacy |
Lawson's classroom test of formal reasoning (Greek) | 4 | Process skills – scientific reasoning |
Chemistry attitudes and experiences questionnaire | 3 | Affective – attitudes, self-efficacy |
Initial and maintained interest in chemistry | 3 | Affective – interest |
Spanish translation of the science motivation questionnaire II | 3 | Affective – motivation |
Meaningful learning in the laboratory instrument | 3 | Affective – attitudes cognitive – self-assessed laboratory skills |
Scientific process skills test (Turkish) | 3 | Process skills |
Implicit information from Lewis structures | 3 | Process skills – information processing |
Test of logical thinking | 3 | Process skills – logical thinking |
Group embedded figures test (Greek) | 3 | Visio-spatial thinking |
Reformed teaching observational protocol | 3 | Teaching strategies/behaviors |
Students’ understanding of models in science | 3 | Nature of science (models) |
Students’ assessment of learning gains | 3 | Self-assessed learning gains |
Topic | Percent publication-instrument instances (N = 460) |
---|---|
Cognitive | 56.7% (n = 261) |
Affective | 32.0% (n = 147) |
Behavioral | 9.1% (n = 42) |
Metacognition | 3.0% (n = 14) |
Evaluation | 2.8% (n = 13) |
Nature of science | 0.4% (n = 2) |
Other | 4.5% (n = 21) |
Topic | Percent unique instruments (N = 377) |
---|---|
Cognitive | 57.3% (n = 216) |
Affective | 30.5% (n = 115) |
Behavioral | 9.5% (n = 36) |
Metacognition | 3.4% (n = 13) |
Evaluation | 2,9% (n = 11) |
Nature of science | 0.5% (n = 2) |
Other | 4.8% (n = 18) |
Type of psychometric evidence | Percent of publication-instrument instances (N = 460) |
---|---|
Evaluation (any) | 60.8% (n = 280) |
Validity (n = 220) | |
Internal structure (factor analysis/principal components analysis/rasch analysis) | 23.3% (n = 107) |
Test content | 22.8% (n = 105) |
Relations with other variables | 21.9% (n = 101) |
Response processes | 11% (n = 50) |
Measurement invariance/DIF | 2.8% (n = 13) |
Reliability (n = 211) | |
Single administration reliability | 45.2% (n = 208) |
Retest reliability | 1.3% (n = 6) |
![]() | ||
Fig. 4 Researcher-developed instruments in publication-instrument instances (N = 460) including standardized exams by topic by year as count (top) and proportion (bottom). |
This journal is © The Royal Society of Chemistry 2023 |