Joseph A.
Harsh
James Madison University, Department of Biology, 951 Carrier Drive, Msc 7801, Harrisonburg, Va 22807, USA. E-mail: harshja@jmu.edu
First published on 17th May 2016
Undergraduate research (UR) is a vetted educational tool that is commonly perceived to prepare students for entering graduate school and careers in STEM fields; however, scholarly attention to date has largely relied on self-report data, which may limit inferences about the causal effects on student outcomes. In light of this, recent calls have been made for innovative and rigorous assessment strategies to better understand the efficacy and impact of UR on key disciplinary skills, both in classroom and internship UR models, that can help inform decisions about educational refinement. To more accurately measure the effect of UR on students, well-designed performance-based assessments can be used to provide direct evidence to the development of targeted skills during their research experience. Given the limited availability of tested, adaptable (and freely available) performance measures for assessing undergraduate chemistry students' scientific thinking skills, this article outlines a five-step process drawn from the literature about how reliable tasks and rubrics can be developed by faculty interested in assessing the effect of research training in the lab and classroom. For this purpose, as an applied example, the manuscript describes the development, testing, and validation of the Performance assessment of Undergraduate Research Experiences (PURE) instrument, which was designed to directly characterize the effects of research experiences on chemistry students' analytical and data-driven decision-making through open-response tasks situated in real-world scientific problems. Initial results reveal that the PURE instrument has high face validity and good reliability in measuring the scientific thinking skills of chemistry student researchers, and documents differences in UR students' answer quality over time supporting the effect of UR on research skill growth and the viability of performance data to assess these changes.
One way to provide rigorous evidence to the efficacy and impact of UR is through performance data that directly measures student progress. In general, performance-based assessment (PBA) requires the application of knowledge in the construction of an original response to authentic tasks valued in their own right (Linn et al., 1991; Zoller, 2001). By providing individuals the opportunity to demonstrate their knowledge when presented a “real-world” challenge, direct assessments of performance are often assumed to have higher validity than indirect measures (Feldon et al., 2010). However, despite calls for generalized evaluations to accurately gauge the effectiveness of UR (Linn et al., 2015), few performance instruments have been designed and tested (or made freely available) to gauge student progress in disciplinary research competencies.† The intent of this article is to provide a framework for a design process drawn from the literature for chemistry faculty interested in creating performance tasks to assess the effect of classroom and research practices on student learning. For this purpose, as an applied example, the manuscript outlines the development, testing, and validation of the Performance assessment of Undergraduate Research Experiences (PURE) instrument, which was designed to directly characterize the effects of research experiences on chemistry students' analytical and data-driven decision-making through open-response tasks situated in real-world scientific problems.
![]() | ||
Fig. 1 Coherent instrument development as mediated by the four-cornerstone model (adapted from Brown and Wilson, 2011). Italicized text represents elements of the PURE Instrument in how they relate to the general (non-italicized) framework. Dashed lines indicate the relationships between cornerstones. |
Similar to preexisting performance instruments (e.g., Stein et al., 2007; Gormally et al., 2012; Shadle et al., 2012) and as outlined by Towns (2010), the PURE was designed, tested, and validated through an iterative process informed by cognitive science and assessment literature, recent relevant PBAs, feedback from experts in chemistry and science education, field testing, and student feedback. The below sections provide a brief description of the step-wise design process, which is summarized in Table 1. Permission was obtained from the Indiana University (Indiana, USA) Institutional Review Board (IRB), and faculty and students were provided a study information sheet detailing the project prior to the voluntary completion of data collection.
Process step | Actions |
---|---|
1. Establishment of expert-Panel | • Recruited a panel of volunteer experts in chemistry and pedagogy to provide feedback in guiding instrument development |
2. Identified target skills |
• Reviewed literature to identify core problem solving and quantitative literacy skill
• Solicited feedback from expert faculty to select skills to be included and what competencies “look like” at different skill level to establish construct validity |
3. Instrument (task and scoring rubric) development |
• Developed tasks and rubrics based largely on iterative feedback from expert panel
• Examined existing validated instruments to inform development and testing • Conducted survey with external faculty regarding their perceptions of the instrument to evaluate content validity • Administered pilot-test the in multiple rounds, and conducted short cognitive interviews with students • Revised initial instrument based on expert panel, item analyses, and feedback from students and external faculty |
4. Collected evidence of instrument validity, reliability, and practicality in skill measuring changes |
• Administered E-PURE to students at multiple institutions near the beginning and end of URE
• Surveyed students to collect information on research activities • Collected feed back from students and faculty mentors using questionnaires to examine test perceptions and difficulties |
5. Evaluated instrument reliability, validity, and practicality |
• Examined students' and faculty mentors perceptions of the test (validity and practicality)
• Evaluated instrument reliability • Analyzed student performance over the URL quantitatively and qualitatively • Revised E-PURE based on item analyses and feedback from students and faculty mentors |
Target skills | Explanation of skill | Number of items used to test skill | |
---|---|---|---|
Understand methods of inquiry that lead to scientific knowledge | |||
1 | Locate information relevant to a chemistry problem. | Identify databases to search for information in the field relevant to a problem | 1 |
2 | Design a research study to test a scientific question | Develop a study that identifies relevant research factors, and collects data to effectively examine the problem. | 3 |
3 | Apply (or know when to apply) appropriate analytical methods to examine a chemical problem. | Identify and use (or know when to use) the “best” practices of the field (i.e. research techniques/instrumentation) | 3 |
4 | Appraise a research design to identify elements and limitations and how they impact scientific findings/conclusions | Identify strengths and weaknesses of research design elements (e.g., potential sources of error, variables, experimental controls) | 2 |
5 | Troubleshoot technical issues | Evaluate a scientific problem to identify possible technical causes | 2 |
Evaluate scientific information | |||
6 | Evaluate evidence and critique experimental designs | Understand the limits of correlational data and experimental design elements | 3 |
7 | Identify additional information needed to evaluate a hypothesis/interpretation | Explain how new information may contribute to the evaluation of a problem | 5 |
8 | Provide alternative explanations for results that may have many causes | Recognize and explain possible alternative interpretations for given data or observations | 1 |
Interpret, represent, and analyze quantitative scientific data | |||
9 | Represent data in a visual form | Convert relevant information into an appropriate visual representation given the type of data | 4 |
10 | Interpret visual representations of data | Interpret or explain information presented in visual forms | 2 |
11 | Interpret basic statistics | Understand the need of basic statistics to quantify uncertainty in data or draw conclusions | 1 |
A | Authenticity in terms of the application of knowledge to real-world problems that often have several “best” answers (Raker and Towns, 2012; Stein et al., 2007) |
B | Aptly challenging students by requiring them to transfer familiar concepts and practices to unfamiliar situations (Quellmalz, 1991) |
C | Using cognitively complex tasks to assess student performance that is probing, open-ended, and multipart to encourage students to provide complete answers that can be easily differentiated upon evaluation (Shadle et al., 2012) |
D | Situating problems in contexts that are purposeful and meaningful to the learner (Quellmalz, 1991) |
E | Generalizable across contexts and approachable to chemistry students of all academic stages (Holme et al., 2010; Linn et al., 1991) |
F | Of practical utility in terms of the time required and the clear communication of expectations (Linn et al., 1991) |
To require the use of higher-order cognitive processes and maximize differences in evaluating responses, items were largely structured in a multipart, open-ended, short-essay response design that focused on problem solving as well as reasoning ability. In a two-step design, where each step was independently evaluated, students were often prompted to first (a) provide a “best” response to a given real-world scenario where there may be multiple “correct” answers, and then (b) offer a rationale regarding why they chose to respond in such a manner. Stein et al. (2007) argue that the particular advantage of such an open design is that “many authentic real-world situations requiring critical thinking and problem solving do not have a simple answer or a simple set of alternatives from which to choose the best answer”. See Box 1 for a sample question.
Given the intent of the instrument to measure the development of CSTSs for URE students across the field of chemistry, a high degree of generalizability in items focusing on concepts and techniques was required (Holme et al., 2010). To take into account that technique-based questions may have less generalizability (Shavelson et al., 1991), faculty colleagues in chemistry (n = 18) from multiple institutions were asked to identify research activities their students commonly had experience with over the prior five years via survey. We used this information to guide task design and improve validity. Even with this information, given our prior experiences with STEM UREs (as researchers, mentors, and former participants), it was assumed that not all students, even those working in the same lab, would have comparable exposure to research-related activities due to project focus and research background. Therefore, we designed multiple tasks for given skills to cover different practices that students may encounter in their research. This approach is consistent with recommendations that a higher degree of generalizability can be obtained by increasing the number and format of performance tasks (Linn et al., 1991).
Based on the design features in Table 3, the tasks were situated in real-world scientific problems, including: (1) the potential effect of Atrazine [a common pesticide] on the sexual development of amphibians, and (2) the antimalarial properties of compounds found in medicinal plants (e.g., Hayes et al., 2002; Ibrahim et al., 2012). The scenarios were chosen based on their social relevance‡ (e.g., water quality, public health), generalizability, and availability of scientific literature to inform the tasks, scoring criteria, and background details. As prior literature in cognition emphasizes that knowledge practiced in one situation may not necessarily transfer to a new unfamiliar situation (Bradford and Schwartz, 1999), a case study approach was used to provide relevant background information with the intent to advance students' familiarity with the “real-world” problems (e.g., topical importance, terminology) to help facilitate their transfer of general EPS and QL skills to the tasks. As possible, information (e.g., methods, figures) from research literature was also used to increase task authenticity and engage student interest. To further evaluate task validity, field-testing was conducted with students (n = 7) at a liberal arts college and a public research university. Student responses were evaluated for general quality and agreement with the intent of the item, and were included in the design process. After completion, students were asked to provide feedback regarding the test through short (<10 minutes) semi-structured cognitive interviews focused on any student difficulties in responding to the tasks as well as their interest in the situating topics and test overall.
These general traits aligned well with the intent to capture information to students' cognitive strategies. As an example, Box 2 includes the associated scoring criteria from the task above (Box 1) that targets students' ability to identify and explain sources of experimental error in a proposed study. Criterion One focuses on the plausibility and relevance of the provided errors to the proposed study design; and Criterion Two focuses on the rationale for the potential implications of the error for the study. As seen in Box 2, multiple criteria were employed to gauge the extent of students' problem-solving skills in the multipart questions. Given the nature of the question, criteria were designed to be either holistic (focused on a general level of response quality) or analytical (focused on discrete response elements) for scoring purposes. The level of proficiency and associated scoring weightings that reflect outcome priority, were based on existing rubrics (Shadle et al., 2012; Harsh et al., 2013) and the collective test writing experience of the author and expert panel.
Determination of response “correctness” was based on three points of information, including: (a) the research literature the problems were initially drawn from to establish validity and identify the “best” responses, (b) well-substantiated secondary sources (e.g., governmental websites) were reviewed to expand the potential range of acceptable responses, and (c) answers provided by two chemistry faculty not affiliated with the study were used to further develop the scoring criteria. As example responses increase the reliability of rubric scoring (Linn et al., 1991), “exemplars” were created from the different sources of information and included in the scoring guide. Again, multiple rounds of iterative revisions based on feedback from the expert panel were used to establish validity and revise each criterion for scientific accuracy, utility, and assessment appropriateness.
The final instrument consisted of 16 multipart questions (30 total tasks) and designed to take 60 minutes for students to complete. To permit students from across the U.S. to respond to complete the assessment at their convenience, the same PURE test was administered online to students using surveys hosted on Qualtrics near the beginning and end of their URE. Students were instructed to answer the questions independently without outside resources. For the purpose of submitting hand-drawn graphical data representations for two QL tasks, students were asked to send the images electronically (i.e. camera/phone, computer scanner). Additionally, a short set of questions was included to capture student views of the instrument (e.g., item clarity, topical interest) as well as relevant experiences that may influence their performance (e.g., familiarity with situating topics, prior technical practices).
Information on students' educational and demographic background as well as their research activities and perceived ability in various skills was provided from survey data collected in the larger project. The short online survey (∼20 minutes) was administered pre- and post-URE with the purpose of evaluating participatory outcomes and activities that might affect student outcomes. Survey data were handled and analyzed in IBM SPSS (v.20).
Of the 54 students solicited, 34 students (63% response rate) provided pre-URE data, and 24 completed the partnered post-test (74% retention rate). Data from all participants were used for the testing of instrument reliability and validity as well as to provide the broadest range of student perceptions regarding the assessment (e.g., topical familiarity). The sample was comprised of relatively equal distributions in gender (41% women), academic stage (51% 1st/2nd year students), research experience (50% first time URE participants), institutional type (56% conducting research at a liberal arts institution), and postgraduate plans (65% STEM career, 32% non-STEM career, and 3% undecided). Student backgrounds were comparable across groups that did and did not complete the posttest as was their respective pretest scores, which suggests comparable proficiencies and that attrition was not linked to poor initial test performance.
Second, Cronbach's α was calculated to evaluate the internal consistency of the tasks to probe the same construct (i.e. how closely related the items are as a group in measuring CSTS). General guidelines (Hair et al., 2006) suggest that “acceptable” α values range from 0.6 to 1.0 dependent on test type (e.g., multiple choice v. open response) and sample size as higher values indicate greater consistency between items. The pre- and posttest consistency estimates were 0.643 and 0.748, respectively, which are comparable to other similar existing instruments and considered good for this type of test and sample size (Stein et al., 2007). The reliability of the larger target skill categories in Table 2 (e.g., understanding methods of scientific inquiry) ranged from α = 0.432 to 0.759, which is considered acceptable for tests with small sample sizes (values >0.4 are proposed to be adequate if n < 30 (Hair et al., 2006)). As α can be deflated with a narrow range of items, it is recommended that multiple items per target skill are employed to appropriately assess student abilities. Third, pre- and posttest scores were found to be normally distributed with no floor or ceiling effects, and the group means were mid-range (Pretest M = 70.9 and Posttest M = 91.9 of a maximum 146 points) suggesting the performance tasks targeted an appropriate difficulty level for these students.
Students were also directed to complete the test independently without the use of external resources for test reliability. While it is not possible to confirm if a student did or did not consult external sources using this information alone, it was assumed that time-on-task could act as a useful guide to assess response credibility. The relationship between time-on-test and test score was found to be non-significant (r = 0.123) and supports instrument reliability in estimating student skills. In other words, the amount of time spent completing a task was not found to influence observed student performance. Further, qualitative evaluation of student responses with extended time-on-task (17% of all responses) revealed no instances where students were benefitted by longer completion times that may have afforded the opportunity to consult external resources. In contrast, students with extended periods in completing tasks tended to be less sophisticated in their responses, possibly reflecting cognitive costs (Monsell, 2003) in switching between test and other non-test activities (e.g., using social media). While these results appear to support the practicality and reliability of the instrument, written feedback from a subset of students (30%) reported concern in the amount of time required to complete the tests. Thus, in an effort to prevent test fatigue and maintain student motivation in future performance evaluations, either a shorter test focused on a smaller set of skills or multiple short assessments would be recommended over a single longer test.
Second, pairwise comparisons were made to assess group-level changes in student scores before and after the experience. (Fig. 2).
![]() | ||
Fig. 2 Comparison of students (n = 24) total score on the PURE instrument by test administration. Error bars represent 95% CIs around the means. |
For PBAs, differentiation in answer quality is essential to ensuring task and scoring criteria validity and reliability (Linn et al., 1991). Further, subtle differences were quantitatively and qualitatively observed in student data, which supports the utility of the instrument in providing reliable insight and evidence to the effect of UREs on the development of complex scientific thinking skills.
Box 1. Representative question from the case study focusing on the effects of the pesticide Atrazine on the sexual development of male frogsIn a literature search, you find a 2003 study by Hayes and colleagues on the effects of atrazine-induced hermaphroditism in male American Leopard frogs (Rana pipiens).¶ Leopard frogs and water samples were collected from a variety of natural habitats in areas with reported low and high use of atrazine in the Midwest USA. The water samples collected at each site were examined for atrazine contamination levels. Below is a modified excerpt from the Hayes et al. study regarding the initial chemical analysis:At each site, we collected water (100 mL) in clean chemical-free glass jars for chemical analysis. Water samples were kept at room temperature (20–25 °C) until analysis. Duplicate samples were analyzed for atrazine levels from all sites by the Hygienic Laboratory (University of Iowa, Iowa City) and by PTRL West Inc. Research Lab (Hercules, CA). Water samples were extracted in organic solvent, followed by aqueous/organic extraction. The samples were comparably prepared at both labs with reagents of similar purities. Samples (between 3 and 4.25 mL) were analyzed by liquid chromatography/mass spectrometry (LC-MS), and compared to a standard reference. Both analytical laboratories were aware of the collection localities and shared their findings. Question 3: In the modified chemical analysis excerpt above, please identify and describe all potential factors in the methodology and analysis that you feel could contribute to variation in the reported atrazine levels between and/or within sites. (Please explain why you identified these factors.) |
Box 2 |
Box 3. Representative answer from a student (student A, new-URE participant, junior) to the explanation of a strategy to test the presence and concentration of atrazine in water samplesI would use tablets that changed the color of a solution depending on the concentration of the chemical in a sample. I would use this because I could compare the concentrations of atrazine in many different samples. (Pre-response, Level 1 – Basic)AGC/MS would be best used to identify and measure the atrazine in the water. The MS would verify its identity, and the GC could help you determine in what concentrations the chemical is present. (Post-response, Level 3 – Adequate) |
Faculty who may want to create their own assessments can use the design process detailed here to inform their instruments. The process (summarized in Table 1) outlines five steps that can guide faculty development and testing of performance-based probes. These general (and modifiable) steps include the: consultation of expert faculty in chemistry and education throughout the design process to establish validity, identification of target skill areas of interest for assessment, development of effective performance tasks that are carefully mapped or aligned with scoring criteria that permit the assessment of changes in skills over time, and the collection and analyses of qualitative and quantitative evidence to measure validity and reliability in testing the desired target skills. Through the use of performance data to identify skill difficulties, interventions can be reformulated or refined to enhance student learning.
Despite the apparent merit of PBAs in stimulating student thought and providing direct evidence to what they can (or can't) “do”, there are three noteworthy limitations of the PURE instrument described here. First, the small sample size (n = 24), while generally considered appropriate for a pilot study, limits rigorous psychometric and quantitative analyses that are desirable in the iterative process of assessment building as well as inference making about students' performance. Future research will be conducted with a larger sample size to inform revisions of the pilot PURE instrument into a more psychometrically sound measure of EPS and QL skills in advance of broader use in UR assessment. On balance, this study provides initial evidence to the use and viability of generalizable performance instruments in assessing student research skills. Second, the design, testing, and scoring of the PURE – and other performance instruments – is time intensive, which may dissuade faculty from using this type of assessment. However, it can be contended that the tradeoff in providing direct evidence to student performance in complex skill domains, such as problem solving, may outweigh the initial “costs” in time for PBA development and practice implementing scoring rubrics. Further, existing vetted instruments can be adopted (or adapted) to meet the needs of faculty in assessing student competencies, helping to decrease the amount of upfront time expended in the design process (Towns, 2010). Third, while substantial efforts were taken to design a generalizable instrument, it is reasonable to expect that there may be some unevenness in students' topical (i.e. Atrazine and Malaria case studies) or technical familiarity – which may favor students with prior exposure. Here, however, comparisons were not drawn between students as research skills were measured on an individual basis to gauge the contributions of URE participation to learning. In other words, students acted as their own “control” group as comparisons were drawn between one's baseline and post-URE skill performance.
Moving forward, research studies on student engagement in research experiences would be strengthened by complementing self-report data on one's competencies with direct documentation of skill progress. Performance evidence to student progress should prove useful to faculty refinement of UR by identifying strengths and gaps in student proficiencies. As this exploratory study demonstrates the usefulness of performance data in providing reliable insight to the effect of UR on chemistry students' skill development, we encourage others to consider developing this type of assessment to meaningfully evaluate the effectiveness of teaching activities and educational programs in support of student learning.
Footnotes |
† See Towns (2010) for a review of cognitive, affective, and psychomotor measures used in chemistry. |
‡ Student survey feedback later indicated that 71% and 65% of students, respectively, found the situating frog hermaphroditism and medicinal plant topics to be of interest. |
§ It is worth mentioning that the intended population(s) for a measure has to be taken into account when designing or using as instrument as the properties of validity and reliability are established for the population it is tested on (AERA/APA/NCME, 2014). |
¶ See Harsh, Esteb, and Maltese (in preparation) for a more extended description of URE participant learning progressions as measured using the PURE. |
This journal is © The Royal Society of Chemistry 2016 |