Considerations on validity for studies using quantitative data in chemistry education research and practice
Abstract
An explicit account of validity considerations within a published paper allows readers to evaluate the evidence that supports the interpretation and use of the data collected within a project. This editorial is meant to provide considerations on how validity has been presented and reviewed among papers submitted to Chemistry Education Research and Practice (CERP) that analyze quantitative data. Authors submitting to CERP are encouraged to make an explicit case for validity and this editorial describes the varying sources of evidence that can be used to organize the evidence presented for validity.
What is validity and validation?
As chemistry education research (CER) has grown in rigor over the past thirty years, so too has attention to validity in CER studies grown. By making an explicit account of validity considerations within a published paper, authors can make a clear case of the evidence that supports the interpretation and use of the data collected within a project. This editorial is meant to provide some thoughts on how validity has been presented and reviewed among papers submitted to Chemistry Education Research and Practice (CERP) that analyze quantitative data. Readers interested in a more comprehensive treatment of validity are encouraged to consult a more authoritative text on validity. One such recommended text is the Standards for Educational and Psychological Testing (Joint Committee on Standards for Educational and Psychological Testing of the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education, 2011) which was codeveloped by three professional organizations of researchers that conduct studies with data from human-subjects. Additionally, Arjoon et al. (2013) published a review of CER instruments using the Standards* framework.
In education research data is often collected with an instrument, which can be a test, survey, or rubric. Each instrument is composed of a series of questions or prompts which will be referred to as items. Validity describes the extent that available evidence supports the use and interpretations of the data collected. Validation is the process of collecting and compiling this evidence. Validity is a judgment decision as authors and readers must judge whether the evidence presented is sufficient and appropriate to support the uses and interpretations of the data resulting from an instrument. Therefore, it is helpful for authors to make an explicit case for validity to aid readers in making this assessment.
A key feature in this definition is that validity is evidence used to support the interpretation of the data collected from an instrument with a specific population. Occasionally papers sent to CERP describe the validity of an instrument or note that “the instrument has been validated”. Reviewers or editors commonly raise a concern on this description as it implies that the instrument will always generate valid data. The data collected by an instrument depends on how the instrument was administered, who the instrument was administered to, and how the data will be used. As a result, validity evidence is specific to the data collected, which includes a consideration of instrument administration and participant background, and the intended use of the data.
Sources of evidence collected
The sources of evidence that can contribute to making a case for validity are varied and as validity is a judgment decision there is no prescriptive set of evidence to compile. Authors, reviewers, and readers can consider the intended use of the data to inform their judgment; for example, a paper that seeks to compare groups may look for evidence that the instrument functions consistently across these groups while another paper may not conduct such comparisons making such evidence less relevant. Table 1 summarizes some common sources of evidence that are explained in more detail below. Historically labels were given for types of validity (e.g. content validity, face validity); most texts now describe validity as a singular construct that can be evaluated based on differing sources of evidence and authors are encouraged to present data as sources of evidence for validity.
Table 1 Common sources of evidence for validity
Source of evidence |
Summary |
Test content |
Evidence that demonstrates an instrument measures the content it is intended to measure |
Internal structure |
Evidence that characterizes the relation among items designed to measure a construct |
Response process |
Evidence that respondents have interpreted the items and/or response scale in the way that was intended by the researcher |
Relationship to other variables |
Evidence of a theoretically supported relationship between the data collected and other external metrics |
Consequence of testing |
Evidence that the interpretations of the data do not introduce bias among groups of participants |
Test content
A particular concern for the content of an instrument is the presence of content underrepresentation (Joint Committee, 2011). Content underrepresentation describes when the content an instrument is intended to measure is either not present or underrepresented within the items. For example, if an instrument to measure students’ understanding of the structure of the atom is used to describe students’ understanding of isotopes, but the instrument contains only a single item specific to isotopes, the lack of sufficient items pertaining to isotopes would represent content underrepresentation. Most often in CERP, evidence for test content has been communicated in papers that describe an instrument's development. Processes for test content include building an instrument from student responses (Cooper et al., 2012), having an expert panel review an instrument (He et al., 2021), or mapping an instrument onto a set of learning objectives (Lewis et al., 2011).
The latter two examples represent what has traditionally been called “face validity” or the idea that whether an instrument works can be evaluated based on a visual inspection of the items. As this type of examination does not consider data collected from the instrument, it is unlikely to make a compelling case for validity on its own but may complement other sources of validity evidence.
Internal structure
Most instruments include multiple distinct items designed to measure the same construct, as information compiled across multiple items provides greater insight than a single item. Evidence to support the internal structure details the extent that the item-level data collected matches the intended structure of the instrument. Instruments may be designed to measure a single construct or multiple constructs and among the latter the multiple constructs may be presumed to be related or unrelated.
Researchers commonly conduct factor analyses to investigate the evidence for internal structure of an instrument. For more established instruments a confirmatory factor analysis will indicate the goodness of fit of the data collected to the intended structure of the instrument (e.g. which items belong to which constructs) (see Komperda et al., 2020 for an example). When comparing groups on the instrument, measurement invariance testing can be used to determine if the internal structure is consistent across student groups. A recent primer published in CERP by Rocabado and colleagues (2020) describes this technique further. For the development of new instruments or applying an existing instrument in a substantially new setting, exploratory factor analyses may be reported to determine the item commonalities that arise from the data. See, for example, the work by Schönborn and colleagues (2015). Costello and Osborne (2005) provide a guide to conducting exploratory factor analysis. Exploratory factor analysis may overfit the data and does not provide goodness of fit indices, so a recommended technique would be to randomly assign half of the dataset for an exploratory factor analysis and then use the other half of the dataset for a confirmatory factor analysis (see Ardura and Perez-Bitrián, 2018; Montes et al., 2022 for examples).
Response process
An important consideration is the extent to which respondents interpret the items, and any associated response scale, within an instrument in a manner consistent with researchers’ expectations. Consider a test item that asks respondents to classify “A match burning” as a chemical or physical process. Students who are only familiar with lighters, and unfamiliar with a “match” as a small stick with a combustible end, may not interpret this item as intended. Evidence for response process is meant to address concerns over participants’ interpretability of the items. Often this evidence is collected through cognitive interviews of participants, where a participant is given each item and asked to discuss their interpretation of each item or process for solving each item as well as how they use an associated response scale. Ideally, the participants engaging in the cognitive interviews have a high degree of similarity to the participants that will ultimately receive the instrument (i.e., the target population); cognitive interviews with experts may not inform how novices would interpret the items. Recently, Deng and colleagues (2021) presented a review of response process evidence in CER studies including recommendations for how to collect response process evidence.
CERP receives and publishes manuscripts reporting studies conducted in international contexts. A challenge in response process arises when an instrument needs to be translated to be accessible to participants in the local context of the study. There are many sides to the translation process in these studies, such as, the translation of the instrument, the collection of data in the local language, and, if publication in an international journal like CERP is sought, reporting the data and its interpretations in the English language. Taber (2018a) describes these challenges further and includes potential strategies to navigate these challenges. One strategy is back-translation, which involves re-translating items back into the original language and comparing the back translated text with the original, fulfilled by experts fluent in both languages. An example study that utilized this strategy is Özalp and Kahveci's (2015) work on diagnostic assessment, in which a group of bilingual chemistry teachers were the experts in the back-translation process, after which a group of students were consulted for further evaluation of the item interpretations. Prospective authors of CERP reporting international research are encouraged to be explicit about whether the data were collected in another language and a translated instrument was used, and the processes enacted to demonstrate the quality and affirmability of the translation process.
Relations to other variables
Most instruments measure a construct that would have theoretically expected relationships with other external variables. For example, students’ performance on a proposed measure of acid–base understanding might be expected to provide high agreement with the same students’ performance on another measure of acid–base understanding. Expected relationships with external variables can either be in the form of agreement, disagreement, or independence. The comparison does not have to be between measurements from two separate instruments. There are examples within CERP where researchers investigating cognitive skills or knowledge, compared instrument data to participants’ relative experience relying on an expected relationship where participants with more relevant experience should score higher (Cooper et al., 2012; Danczak et al., 2020; He et al., 2021). Another example related data from a measure of students’ science motivation to their persistence in enrolling in science courses (Ardura and Perez-Bitrián, 2018). Evidence relating measurements to other variables can serve both to make a case that the measurement is functioning as intended and begin to outline how the measurement may be useful in understanding other phenomena.
Consequence of testing
For instruments that may be used in higher-stakes decisions such as admissions to a course or program, it is important to ascertain the extent to which the data is related to the outcome of the decision. Consider a case where scores on an instrument are used to determine who can take a course. Evidence should be collected that the scores from the instrument are related to the likelihood of success in the course. Sometimes the relationship with the decision can be indirect. For example, a test can be used to determine part of a course grade and then the course grade is used to determine whether a student meets a pre-requisite requirement to take the follow-on course. In such instances, relating the test to academic performance in the subsequent course can be used to determine the appropriateness of either the test for determining the earlier course grades or the prerequisite relationship between courses (see Lewis, 2014 for an example). For instruments used in higher-stakes decisions, researchers may pay particular attention to the consistency of the instrument across student groups (e.g., through measurement invariance testing) as inconsistencies across groups can lead to inequitable student outcomes.
Reliability
While validity is meant to describe the presence of systemic error in the data, reliability describes the presence of random error or the error that can be attributed to fluctuations that naturally occur in responding to a set of items. A common method for reporting evidence of reliability is to report a value for Cronbach's alpha. Cronbach's alpha relies on an assumption that a set of items belong to a common construct; conducting a confirmatory factor analysis modeling each item onto a common construct and examining the goodness of fit indices can investigate the extent this assumption is tenable (Komperda et al., 2018). Cronbach's alpha also relies on an assumption that each item has equal weighting in determining the construct. More recently, CER studies have begun to use McDonald's omega to describe the consistency across items without the assumption of equal weighting (Komperda et al., 2018). Alternatively, Rasch modeling can also be used to determine consistency across items with unequal weighting (see He et al., 2016 for an example). Higher alpha or omega values provide evidence of lower random error presented throughout the set of items. Some misapplications are the reporting of a satisfactory Cronbach's alpha value while analyzing scores for each item, or reporting a single Cronbach's alpha value for an instrument that is designed to measure multiple constructs (Taber, 2018b). Authors are encouraged to report alpha or omega values when reporting data from the single administration of an instrument. Alternatively, reliability can be investigated via test–retest reliability when multiple administrations of an instrument with the same population is employed, to explore variations across time (Komperda et al., 2018).
Considerations and resources
Validity is a judgment process which means there is no universally accepted decision on what makes a set of data valid for an intended use. As such, it is not possible to provide a prescriptive list for what is required to demonstrate sufficient evidence of validity. The types of evidence reviewed above do not represent a list of requirements. Instead, the types of evidence are meant to serve authors in organizing their evidence for validity which can make it easier for reviewers and readers to ascertain the evidence. Incumbent in the judgment decision is the intended use of the data obtained by the instrument. An instrument that is proposed to inform admissions to a program may require a higher bar of evidence than an instrument that is meant to test a theorized relationship, in the eyes of the research community.
Authors that develop a new instrument are laying the foundation of evidence for validity and as such may be expected to present more evidence of validity than authors using an established instrument. For example, a new instrument where only the internal structure has been demonstrated may leave reviewers and readers with a concern over other types of evidence depending on the proposed use. For a project that uses a previously published instrument, evidence can be compiled from both the data collected at the local site and from evidence presented in the literature on past data collected by the instrument. When relying on evidence for validity from the literature, it is helpful to detail the similarities and differences in research settings and instrument administration between the current study and the study in the literature. As studies can rely on evidence from past work, current studies can therefore serve future work by describing who the participants are (e.g., first-year post-secondary chemistry students) and the conditions for administration such as whether it was administered in-person or online, the use of incentives, and the use of a time limit. Recently an effort has been made to develop a resource that indexes instruments and evidence of validity presented in CER studies (Barbera et al., 2022). This resource can serve researchers in both identifying instruments to use and finding past validity evidence that is available to incorporate.
Ultimately, the quality and transparency of evidence for validity plays a central role in evaluating the appropriate uses of the data collected. Researchers that analyze quantitative data are therefore encouraged to incorporate an explicit account on the evidence for validity of the data collected.
Acknowledgements
The author would like to thank Jack Barbera, Gwendolyn Lawrie, James Nyachwaya, and Ajda Kahveci for their thoughtful comments on a draft of this paper.
References
- Ardura D. and Perez-Bitrián A., (2018), The effect of motivation on the choice of chemistry in secondary schools: adaptation and validation of the Science Motivation Questionnaire II to Spanish students, Chem. Educ. Res. Pract., 19, 905–918.
- Arjoon J. A., Xu X. Y. and Lewis J. E., (2013), Understanding the State of the Art for Measurement in Chemistry Education Research: Examining the Psychometric Evidence, J. Chem. Educ., 90, 536–545.
- Barbera J., Harshman J. and Komperda R. (ed.), The Chemistry Instrument Review and Assessment Library, 2022, https://chiral.chemedx.org, accessed July 29 2022.
- Cooper M. M., Underwood S. M. and Hilley C. Z., (2012), Development and validation of the implicit information from Lewis structures instrument (IILSI): do students connect structures with properties?, Chem. Educ. Res. Pract., 13, 195–200.
- Costello A. B. and Osborne J., (2005), Best practices in exploratory factor analysis: four recommendations for getting the most from your analysis, Pract. Assess. Res. Eval., 10, 7.
- Danczak S. M., Thompson C. D. and Overton T. L., (2020), Development and validation of an instrument to measure undergraduate chemistry students’ critical thinking skills, Chem. Educ. Res. Pract., 21, 62–78.
- Deng J. M., Streja N. and Flynn A. B., (2021), Response Process Validity Evidence in Chemistry Education Research, J. Chem. Educ., 98, 3656–3666.
- He P., Liu X., Zheng C. and Jia M., (2016), Using Rasch measurement to validate an instrument for measuring the quality of classroom teaching in secondary chemistry lessons, Chem. Educ. Res. Pract., 17, 381–393.
- He P., Zheng C. and Li T., (2021), Development and validation of an instrument for measuring Chinese chemistry teachers’ perceptions of pedagogical content knowledge for teaching chemistry core competencies, Chem. Educ. Res. Pract., 22, 513–531.
- Joint Committee on Standards for Educational and Psychological Testing of the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education, (2011), Standards for educational and psychological testing, Washington, DC: American Educational Research Association.
- Komperda R., Pentecost T. C. and Barbera J., (2018), Moving beyond Alpha: A Primer on Alternative Sources of Single-Administration Reliability Evidence for Quantitative Chemistry Education Research, J. Chem. Educ., 95, 1477–1491.
- Komperda R., Hosbein K. N., Phillips M. M. and Barbera J., (2020), Investigation of evidence for the internal structure of a modified science motivation questionnaire II (mSMQ II): a failed attempt to improve instrument functioning across course, subject, and wording variants, Chem. Educ. Res. Pract., 21, 893–907.
- Lewis S. E., (2014), Examining Evidence for External and Consequential Validity of the First Term General Chemistry Exam from the ACS Examinations Institute, J. Chem. Educ., 91, 793–799.
- Lewis S. E., Shaw J. L. and Freeman K. A., (2011), Establishing open-ended assessments: investigating the validity of creative exercises, Chem. Educ. Res. Pract., 12, 158–166.
- Montes L. H., Ferreira R. A. and Rodríguez C., (2022), The attitude to learning chemistry instrument (ALChI): linking sex, achievement, and attitudes, Chem. Educ. Res. Pract., 23, 686–697.
- Özalp D. and Kahveci A., (2015), Diagnostic assessment of student misconceptions about the particulate nature of matter from ontological perspective, Chem. Educ. Res. Pract., 16, 619–639.
- Rocabado G. A., Komperda R., Lewis J. E. and Barbera J., (2020), Addressing diversity and inclusion through group comparisons: a primer on measurement invariance testing, Chem. Educ. Res. Pract., 21, 969–988.
- Schönborn K. J., Höst G. E. and Lundin Palmerius K. E., (2015), Measuring understanding of nanoscience and nanotechnology: development and validation of the nano-knowledge instrument (NanoKI), Chem. Educ. Res. Pract., 16, 346–354.
- Taber K. S., (2018a), Lost and found in translation: guidelines for reporting research data in an ‘other’ language, Chem. Educ. Res. Pract., 19, 646–652.
- Taber K. S., (2018b), The Use of Cronbach's Alpha When Developing and Reporting Research Instruments in Science Education, Res. Sci. Educ., 48, 1273–1296.
|
This journal is © The Royal Society of Chemistry 2022 |
Click here to see how this site uses Cookies. View our privacy policy here.