Designing performance-based measures to assess the scientific thinking skills of chemistry undergraduate researchers

Joseph A. Harsh

doi:10.1039/C6RP00057F

View PDF VersionPrevious ArticleNext Article

DOI: 10.1039/C6RP00057F (Paper) Chem. Educ. Res. Pract., 2016, 17, 808-817

Designing performance-based measures to assess the scientific thinking skills of chemistry undergraduate researchers

Joseph A. Harsh
James Madison University, Department of Biology, 951 Carrier Drive, Msc 7801, Harrisonburg, Va 22807, USA. E-mail: harshja@jmu.edu

Received 1st March 2016 , Accepted 17th May 2016

First published on 17th May 2016

Abstract

Undergraduate research (UR) is a vetted educational tool that is commonly perceived to prepare students for entering graduate school and careers in STEM fields; however, scholarly attention to date has largely relied on self-report data, which may limit inferences about the causal effects on student outcomes. In light of this, recent calls have been made for innovative and rigorous assessment strategies to better understand the efficacy and impact of UR on key disciplinary skills, both in classroom and internship UR models, that can help inform decisions about educational refinement. To more accurately measure the effect of UR on students, well-designed performance-based assessments can be used to provide direct evidence to the development of targeted skills during their research experience. Given the limited availability of tested, adaptable (and freely available) performance measures for assessing undergraduate chemistry students' scientific thinking skills, this article outlines a five-step process drawn from the literature about how reliable tasks and rubrics can be developed by faculty interested in assessing the effect of research training in the lab and classroom. For this purpose, as an applied example, the manuscript describes the development, testing, and validation of the Performance assessment of Undergraduate Research Experiences (PURE) instrument, which was designed to directly characterize the effects of research experiences on chemistry students' analytical and data-driven decision-making through open-response tasks situated in real-world scientific problems. Initial results reveal that the PURE instrument has high face validity and good reliability in measuring the scientific thinking skills of chemistry student researchers, and documents differences in UR students' answer quality over time supporting the effect of UR on research skill growth and the viability of performance data to assess these changes.

Introduction

Recent calls in science, technology, engineering, and mathematics (STEM) education regularly highlight engaging undergraduates in disciplinary research practices for the preparation of the scientific workforce and an educated citizenry (Boyer Commission, 2008; American Academy for the Advancement of the Sciences [AAAS], 2011; President's Council of Advisors in Science and Technology [PCAST], 2012). Motivated by this view, major efforts have been nationally undertaken by stakeholders to increase student access to undergraduate research (UR), including Course-based Undergraduate Research Experiences (CUREs) or standalone Undergraduate Research Experiences (UREs), for enculturation in field-based practices (e.g., communication, data collection and analysis) and the process of actively doing science. In support of these innovations, prior studies have outlined numerous cognitive, affective, psychological, and conative benefits to UR participants across science disciplines as well as enhancing the graduation and retention of students in STEM degree programs and careers (as reviewed in Laursen et al., 2010; Lopatto, 2010; Corwin et al., 2015). Even with such findings, most prior work is based on participant self-reports, which may limit assessment of the causal effects of these interventions on student outcomes (Crowe and Brakke, 2008; Linn et al., 2015). This is concerning as faculty may be left asking “What are my students actually taking from the experience?”, which is vital for informing decisions about programmatic improvement (Holme et al., 2010).

One way to provide rigorous evidence to the efficacy and impact of UR is through performance data that directly measures student progress. In general, performance-based assessment (PBA) requires the application of knowledge in the construction of an original response to authentic tasks valued in their own right (Linn et al., 1991; Zoller, 2001). By providing individuals the opportunity to demonstrate their knowledge when presented a “real-world” challenge, direct assessments of performance are often assumed to have higher validity than indirect measures (Feldon et al., 2010). However, despite calls for generalized evaluations to accurately gauge the effectiveness of UR (Linn et al., 2015), few performance instruments have been designed and tested (or made freely available) to gauge student progress in disciplinary research competencies.† The intent of this article is to provide a framework for a design process drawn from the literature for chemistry faculty interested in creating performance tasks to assess the effect of classroom and research practices on student learning. For this purpose, as an applied example, the manuscript outlines the development, testing, and validation of the Performance assessment of Undergraduate Research Experiences (PURE) instrument, which was designed to directly characterize the effects of research experiences on chemistry students' analytical and data-driven decision-making through open-response tasks situated in real-world scientific problems.

Development of the PURE instrument

Study rationale and general overview of the PURE development and testing

The PURE was piloted as part of a national study, titled Undergraduate Scientists: Measuring Outcomes of Research Experiences (US-MORE; NSF #1140445), designed to investigate the nature of UREs and the broad range of effects conferred to participants in the physical sciences. To complement self-report research (i.e. surveys, interviews), the PURE was developed to provide direct evidence to changes in chemistry students' problem solving over the duration of a URE. For instrument development, Brown et al. (2010) outlined a four-cornerstone cycle guiding the development of a coherent assessment instrument (Fig. 1). These cornerstones include: tasks that elicit observable performance data, scoring guides to assign relative value to the performance data, a measurement model to aggregate student performances, and a model of cognition to understand how students represent knowledge and develop competence. Here, competence is distinguished as what one is capable of and performance as what one actually does in a real-world situation as a product of one's competence combined with individual (e.g. research activities, motivation) and system-based influences (e.g., provided mentorship) (Rethans et al., 2002). As the model of cognition is the interpretation of student competencies, a “blueprint” was developed with guidance from expert faculty by mapping out the performance skills to be assessed and how competencies manifest (i.e. “look like”) at different skill levels.


	Fig. 1 Coherent instrument development as mediated by the four-cornerstone model (adapted from Brown and Wilson, 2011). Italicized text represents elements of the PURE Instrument in how they relate to the general (non-italicized) framework. Dashed lines indicate the relationships between cornerstones.

Similar to preexisting performance instruments (e.g., Stein et al., 2007; Gormally et al., 2012; Shadle et al., 2012) and as outlined by Towns (2010), the PURE was designed, tested, and validated through an iterative process informed by cognitive science and assessment literature, recent relevant PBAs, feedback from experts in chemistry and science education, field testing, and student feedback. The below sections provide a brief description of the step-wise design process, which is summarized in Table 1. Permission was obtained from the Indiana University (Indiana, USA) Institutional Review Board (IRB), and faculty and students were provided a study information sheet detailing the project prior to the voluntary completion of data collection.

Table 1 Overview of PURE design and testing process

Process step	Actions
1. Establishment of expert-Panel	• Recruited a panel of volunteer experts in chemistry and pedagogy to provide feedback in guiding instrument development

2. Identified target skills	• Reviewed literature to identify core problem solving and quantitative literacy skill • Solicited feedback from expert faculty to select skills to be included and what competencies “look like” at different skill level to establish construct validity

3. Instrument (task and scoring rubric) development	• Developed tasks and rubrics based largely on iterative feedback from expert panel • Examined existing validated instruments to inform development and testing • Conducted survey with external faculty regarding their perceptions of the instrument to evaluate content validity • Administered pilot-test the in multiple rounds, and conducted short cognitive interviews with students • Revised initial instrument based on expert panel, item analyses, and feedback from students and external faculty

4. Collected evidence of instrument validity, reliability, and practicality in skill measuring changes	• Administered E-PURE to students at multiple institutions near the beginning and end of URE • Surveyed students to collect information on research activities • Collected feed back from students and faculty mentors using questionnaires to examine test perceptions and difficulties

5. Evaluated instrument reliability, validity, and practicality	• Examined students' and faculty mentors perceptions of the test (validity and practicality) • Evaluated instrument reliability • Analyzed student performance over the URL quantitatively and qualitatively • Revised E-PURE based on item analyses and feedback from students and faculty mentors

Step 1. Expert panel guidance in instrument design. For PBAs, expert review is a central developmental feature to build validity – the extent to which a given instrument succeeds in assessing the particular competencies that it was intended to assess (Mehrens, 1992). Throughout design and testing of the PURE, opinions were iteratively pooled from a panel of chemistry experts (n = 5), varying in research interests and institutional type, to gain consensus on task design, alignment of tasks to target skills, and desired student performance in chemistry scientific thinking skills (CSTSs). Contributions were also sought from education experts (n = 2) to the pedagogical appropriateness of the instrument (tasks and scoring criteria). Recursive feedback was solicited from the expert panel throughout the design process, which continued as incremental changes were made to the instrument until a general level of consensus was reached (Dalkey, 1972). To compensate for potential weaknesses associated with a recursive design method (e.g., low response rates, compromised solutions; Hsu and Sanford, 2007), efforts were taken to minimize time requirements through task distribution and explicitly outlining expectations.

Step 2. Identification of target skills. The first part of the design process was to identify competencies in quantitative literacy and experimental problem solving central to scientific thinking (see Table 2 for the 11 targeted skills). Here, Experimental problem solving (EPS) includes the ability to “define a problem clearly, develop testable hypotheses, design and execute experiments, analyze data, and draw appropriate conclusions” (American Chemical Society [ACS], 2008, p. 1). Selected EPS skills include 1, 2, 3, 4, 5, 6, 7, and 8 in Table 2. Quantitative literacy (QL) is a habit of the mind and knowledge and skills necessary to effectively engage in daily and scientific quantitative situations (National Center on Education and the Disciplines [NCED], 2001). Targeted QL skills include 9, 10, and 11 (Table 2). The EPS and QL skills were selected based on being recognized in the literature as core elements for scientific literacy (AAAS, 2011; Gormally et al., 2012), and reflect real-world traits and practices used by scientists (NCED, 2001; ACS, 2008) that often do not translate well in the learning environment (Cooper, 2007). Additionally, these skills are commonly identified as URE outcomes (Laursen et al., 2010), and specifically focused on in the larger project with chemistry students regularly self-reporting gains to varying degrees. Once a draft list had been developed, feedback was solicited, in several rounds of revisions, from the expert panel to ensure the appropriateness of the targeted skills.

Table 2 EPS and QL skills targeted by the PURE

Target skills		Explanation of skill	Number of items used to test skill
Understand methods of inquiry that lead to scientific knowledge
1	Locate information relevant to a chemistry problem.	Identify databases to search for information in the field relevant to a problem	1
2	Design a research study to test a scientific question	Develop a study that identifies relevant research factors, and collects data to effectively examine the problem.	3
3	Apply (or know when to apply) appropriate analytical methods to examine a chemical problem.	Identify and use (or know when to use) the “best” practices of the field (i.e. research techniques/instrumentation)	3
4	Appraise a research design to identify elements and limitations and how they impact scientific findings/conclusions	Identify strengths and weaknesses of research design elements (e.g., potential sources of error, variables, experimental controls)	2
5	Troubleshoot technical issues	Evaluate a scientific problem to identify possible technical causes	2
Evaluate scientific information
6	Evaluate evidence and critique experimental designs	Understand the limits of correlational data and experimental design elements	3
7	Identify additional information needed to evaluate a hypothesis/interpretation	Explain how new information may contribute to the evaluation of a problem	5
8	Provide alternative explanations for results that may have many causes	Recognize and explain possible alternative interpretations for given data or observations	1
Interpret, represent, and analyze quantitative scientific data
9	Represent data in a visual form	Convert relevant information into an appropriate visual representation given the type of data	4
10	Interpret visual representations of data	Interpret or explain information presented in visual forms	2
11	Interpret basic statistics	Understand the need of basic statistics to quantify uncertainty in data or draw conclusions	1

Step 3. Instrument development.
Task design. Performance assessments are often favored to indirect measures due to their ability to provide a realistic accounting of what the student “does” or “can do” and can act to promote student learning; however, criteria that define quality assessments must also be considered (Linn et al., 1991). Task design was therefore informed by the development process of recent instrument descriptions (Stein et al., 2007; Timmerman et al., 2010; Gormally et al., 2012; Shadle et al., 2012) and key design features (Table 3) highlighted in PBA literature.

Table 3 Summary of performance task design features used in developing the PURE

A	Authenticity in terms of the application of knowledge to real-world problems that often have several “best” answers (Raker and Towns, 2012; Stein et al., 2007)
B	Aptly challenging students by requiring them to transfer familiar concepts and practices to unfamiliar situations (Quellmalz, 1991)
C	Using cognitively complex tasks to assess student performance that is probing, open-ended, and multipart to encourage students to provide complete answers that can be easily differentiated upon evaluation (Shadle et al., 2012)
D	Situating problems in contexts that are purposeful and meaningful to the learner (Quellmalz, 1991)
E	Generalizable across contexts and approachable to chemistry students of all academic stages (Holme et al., 2010; Linn et al., 1991)
F	Of practical utility in terms of the time required and the clear communication of expectations (Linn et al., 1991)

To require the use of higher-order cognitive processes and maximize differences in evaluating responses, items were largely structured in a multipart, open-ended, short-essay response design that focused on problem solving as well as reasoning ability. In a two-step design, where each step was independently evaluated, students were often prompted to first (a) provide a “best” response to a given real-world scenario where there may be multiple “correct” answers, and then (b) offer a rationale regarding why they chose to respond in such a manner. Stein et al. (2007) argue that the particular advantage of such an open design is that “many authentic real-world situations requiring critical thinking and problem solving do not have a simple answer or a simple set of alternatives from which to choose the best answer”. See Box 1 for a sample question.

Given the intent of the instrument to measure the development of CSTSs for URE students across the field of chemistry, a high degree of generalizability in items focusing on concepts and techniques was required (Holme et al., 2010). To take into account that technique-based questions may have less generalizability (Shavelson et al., 1991), faculty colleagues in chemistry (n = 18) from multiple institutions were asked to identify research activities their students commonly had experience with over the prior five years via survey. We used this information to guide task design and improve validity. Even with this information, given our prior experiences with STEM UREs (as researchers, mentors, and former participants), it was assumed that not all students, even those working in the same lab, would have comparable exposure to research-related activities due to project focus and research background. Therefore, we designed multiple tasks for given skills to cover different practices that students may encounter in their research. This approach is consistent with recommendations that a higher degree of generalizability can be obtained by increasing the number and format of performance tasks (Linn et al., 1991).

Based on the design features in Table 3, the tasks were situated in real-world scientific problems, including: (1) the potential effect of Atrazine [a common pesticide] on the sexual development of amphibians, and (2) the antimalarial properties of compounds found in medicinal plants (e.g., Hayes et al., 2002; Ibrahim et al., 2012). The scenarios were chosen based on their social relevance‡ (e.g., water quality, public health), generalizability, and availability of scientific literature to inform the tasks, scoring criteria, and background details. As prior literature in cognition emphasizes that knowledge practiced in one situation may not necessarily transfer to a new unfamiliar situation (Bradford and Schwartz, 1999), a case study approach was used to provide relevant background information with the intent to advance students' familiarity with the “real-world” problems (e.g., topical importance, terminology) to help facilitate their transfer of general EPS and QL skills to the tasks. As possible, information (e.g., methods, figures) from research literature was also used to increase task authenticity and engage student interest. To further evaluate task validity, field-testing was conducted with students (n = 7) at a liberal arts college and a public research university. Student responses were evaluated for general quality and agreement with the intent of the item, and were included in the design process. After completion, students were asked to provide feedback regarding the test through short (<10 minutes) semi-structured cognitive interviews focused on any student difficulties in responding to the tasks as well as their interest in the situating topics and test overall.

Scoring guide development. Similar to the test items, the scoring guide (i.e. criteria and exemplar responses) was developed using recursive expert feedback and modeled after existing open-response format instruments focusing on thinking skills (Stein et al., 2007; Timmerman et al., 2010; Gormally et al., 2012; Shadle et al., 2012; Harsh et al., 2013). Particular attention was paid to three general traits for evaluating EPS questions, including the extent to which a student response: (1) identified the important or relevant features of the problem, (2) presented a complete justification or explanation for strategies formulated to solve a problem, and (3) provided an effective means to solve problems in the field (Shadle et al., 2012). In addition, a fourth trait, focusing on students' QL abilities to understand and use visual data and statistics to solve problems, was included.

These general traits aligned well with the intent to capture information to students' cognitive strategies. As an example, Box 2 includes the associated scoring criteria from the task above (Box 1) that targets students' ability to identify and explain sources of experimental error in a proposed study. Criterion One focuses on the plausibility and relevance of the provided errors to the proposed study design; and Criterion Two focuses on the rationale for the potential implications of the error for the study. As seen in Box 2, multiple criteria were employed to gauge the extent of students' problem-solving skills in the multipart questions. Given the nature of the question, criteria were designed to be either holistic (focused on a general level of response quality) or analytical (focused on discrete response elements) for scoring purposes. The level of proficiency and associated scoring weightings that reflect outcome priority, were based on existing rubrics (Shadle et al., 2012; Harsh et al., 2013) and the collective test writing experience of the author and expert panel.

Determination of response “correctness” was based on three points of information, including: (a) the research literature the problems were initially drawn from to establish validity and identify the “best” responses, (b) well-substantiated secondary sources (e.g., governmental websites) were reviewed to expand the potential range of acceptable responses, and (c) answers provided by two chemistry faculty not affiliated with the study were used to further develop the scoring criteria. As example responses increase the reliability of rubric scoring (Linn et al., 1991), “exemplars” were created from the different sources of information and included in the scoring guide. Again, multiple rounds of iterative revisions based on feedback from the expert panel were used to establish validity and revise each criterion for scientific accuracy, utility, and assessment appropriateness.

Step 4. Evidence of instrument validity, reliability, and generalizability.
Test administration. In Summer 2013, all chemistry student researchers (n = 54) at seven institutions from the larger national project were invited to participate in this study. The institutions provided breadth in institutional type (three liberal arts schools, one master's granting institution, and three research universities) and research experience (e.g., formal programs v. faculty-sponsored internships). Students at each institution were engaged in an 8- to 10-week intensive (∼40 hours per week) URE and were discouraged from activities that may detract from their full-time research “employment”. Student volunteers were offered a modest stipend in recognition of their time.

The final instrument consisted of 16 multipart questions (30 total tasks) and designed to take 60 minutes for students to complete. To permit students from across the U.S. to respond to complete the assessment at their convenience, the same PURE test was administered online to students using surveys hosted on Qualtrics near the beginning and end of their URE. Students were instructed to answer the questions independently without outside resources. For the purpose of submitting hand-drawn graphical data representations for two QL tasks, students were asked to send the images electronically (i.e. camera/phone, computer scanner). Additionally, a short set of questions was included to capture student views of the instrument (e.g., item clarity, topical interest) as well as relevant experiences that may influence their performance (e.g., familiarity with situating topics, prior technical practices).

Information on students' educational and demographic background as well as their research activities and perceived ability in various skills was provided from survey data collected in the larger project. The short online survey (∼20 minutes) was administered pre- and post-URE with the purpose of evaluating participatory outcomes and activities that might affect student outcomes. Survey data were handled and analyzed in IBM SPSS (v.20).

Of the 54 students solicited, 34 students (63% response rate) provided pre-URE data, and 24 completed the partnered post-test (74% retention rate). Data from all participants were used for the testing of instrument reliability and validity as well as to provide the broadest range of student perceptions regarding the assessment (e.g., topical familiarity). The sample was comprised of relatively equal distributions in gender (41% women), academic stage (51% 1st/2nd year students), research experience (50% first time URE participants), institutional type (56% conducting research at a liberal arts institution), and postgraduate plans (65% STEM career, 32% non-STEM career, and 3% undecided). Student backgrounds were comparable across groups that did and did not complete the posttest as was their respective pretest scores, which suggests comparable proficiencies and that attrition was not linked to poor initial test performance.

Faculty mentor feedback. To further assess test validity, a subset of research mentors (n = 16) from the larger project were provided a copy of the PURE for review. A short 18-item questionnaire was used to solicit feedback on the validity and utility of the assessment in measuring URE students' EPS and QL skills. Volunteer faculty, representing a range of interests in the field and institutional types, commented on test comprehensiveness and generalizability; item and criteria clarity and accuracy; and use of the instrument for assessment and teaching purposes.
Performance data scoring. Student performance data were evaluated by science faculty (n = 4) and science education researchers (n = 2) with graduate degrees in the sciences using methods comparable to preexisting instruments (Stein et al., 2007; Timmerman et al., 2010; Shadle et al., 2012; Harsh et al., 2013). In advance of scoring, data were stripped of identifiers (i.e. participant name, pre-/posttest) and raters received training on each task they would be evaluating. To increase reliability, a scoring guide with detailed criteria and exemplar responses at different proficiency levels was provided (Timmerman et al., 2010). A small number (3 to 4) of student responses were initially scored “out loud” by the raters to discuss how they would apply the rubric. Following this, the raters then independently scored 2 to 3 further responses, and then compared rubric application by examining their scoring results. Any substantial scoring differences (±2 levels on a 4- or 5-level rubric), were discussed by the raters until there was a general level of consistency (±1 level) in rubric use (Shadle et al., 2012). In the case that issues emerged with the scoring criteria during the training session, rater feedback was used to make refinements to the rubrics as necessary. Most commonly, these were small changes in phrasing to clarify expectations or to better allow differentiation in student responses. Once the raters felt comfortable in the consistent use of the scoring guide, the remaining balance of student answers were independently scored by two raters over a set time period. In the case of notable differences in rater scoring on the same response were identified (±2 levels), raters were asked to again individually review and rescore the response without referencing their prior ratings. Often, this led to one rater modifying his/her score due to a general oversight (e.g., lack of consistency in rubric use, inserting an incorrect value into the scoring template). If scoring differences spanning multiple scale values persisted, the scores were discussed by the raters with one rater occasionally changing their score to be consistent with the other. The final assigned score for each question was calculated by averaging the raters' scores (Stein et al., 2007). This process was repeated for each question on the test.

Step 5. Evaluation of instrument reliability, validity, generalizability, and practicality. In assessing students' knowledge and judging competencies, it is crucial that the instrument used to assess that knowledge produces reliable and valid scores and their associated interpretations for the intended population.§ (American Educational Research Association, American Psychological Association, and National Council on Measurement [AERA/APA/NCME], 2014). Additionally, faculty and student perceptions of the PURE were obtained to ensure the instrument met the key design features of being generalizable and of practical utility (Linn et al., 1991).
Validity. The validity of the instrument as a measurement tool was again based on the authenticity of the problems drawn from research literature and evaluation by expert chemistry faculty across a range of institutions and disciplines. High levels of agreement in faculty feedback support the face validity of the questions. In addition, the assessment was found to have high content validity in covering the range of target skills that comprise scientific thinking in the field when evaluated by experts.
Reliability. Reliability – a measure of the reproducibility or consistency – was characterized using multiple statistical analyses. First, intraclass correlation (ICC), a standard measure of interrater reliability, was used to estimate the degree of consistency among raters. In this assessment, 92% of variation in PURE scores was due to differences in the quality of student responses and only 8% was due to variation between trained raters. The average reliability across individual test items ranged from 0.76 to 0.95. These findings suggest a high level of scoring reliability (or consistency between raters), a common problem for open-ended tests (Stein et al., 2007). It should be pointed out, however, that common scoring practices used here (i.e. scorer training, consensus-building on discrepant scores, and the use of exemplars) enhances interrater reliability.

Second, Cronbach's α was calculated to evaluate the internal consistency of the tasks to probe the same construct (i.e. how closely related the items are as a group in measuring CSTS). General guidelines (Hair et al., 2006) suggest that “acceptable” α values range from 0.6 to 1.0 dependent on test type (e.g., multiple choice v. open response) and sample size as higher values indicate greater consistency between items. The pre- and posttest consistency estimates were 0.643 and 0.748, respectively, which are comparable to other similar existing instruments and considered good for this type of test and sample size (Stein et al., 2007). The reliability of the larger target skill categories in Table 2 (e.g., understanding methods of scientific inquiry) ranged from α = 0.432 to 0.759, which is considered acceptable for tests with small sample sizes (values >0.4 are proposed to be adequate if n < 30 (Hair et al., 2006)). As α can be deflated with a narrow range of items, it is recommended that multiple items per target skill are employed to appropriately assess student abilities. Third, pre- and posttest scores were found to be normally distributed with no floor or ceiling effects, and the group means were mid-range (Pretest M = 70.9 and Posttest M = 91.9 of a maximum 146 points) suggesting the performance tasks targeted an appropriate difficulty level for these students.

Generalizability. Student and mentor feedback was obtained about their perceptions of instrument generalizability. Mentor comments were especially positive about the appropriateness of the instrument's content coverage and cognitive complexity (not too difficult/simple) for the measurement of their students' EPS and QL skills. Similarly, most student respondents (97%) indicated that the questions aptly challenged their thinking. Approximately one-in-five students reported that they were unable to answer at least one question on the posttest due to a lack of familiarity with specific research techniques. While the inability of all students to answer items can be viewed negatively, for the purpose of this study, it afforded a comparison group of sorts that demonstrated (expectedly) little change in pre- and posttest responses – which is beneficial for establishing instrument validity. However, for faculty interested in using PBAs in the classroom, low generalizability should be avoided as it limits the reliable measurement of student knowledge/skills and can negatively affect student motivation.
Practicality. Test practicality – or the usefulness of a measure given issues relevant to administration (Linn et al., 1991) – was evaluated by time-on-task and response credibility. Average student test duration, collected as a function of the online survey software, was approximately 105 minutes (S.D. 82 minutes) – which was designed to take 60 minutes for completion. However, despite instructions to complete the online test in a single uninterrupted sitting, informal comments from several students revealed varying time durations spent “off-task” for assorted personal reasons. This draws to question the validity of the timing data as none of the participants in the proctored field testing (n = 7) required >70 minutes to complete a longer pilot test.

Students were also directed to complete the test independently without the use of external resources for test reliability. While it is not possible to confirm if a student did or did not consult external sources using this information alone, it was assumed that time-on-task could act as a useful guide to assess response credibility. The relationship between time-on-test and test score was found to be non-significant (r = 0.123) and supports instrument reliability in estimating student skills. In other words, the amount of time spent completing a task was not found to influence observed student performance. Further, qualitative evaluation of student responses with extended time-on-task (17% of all responses) revealed no instances where students were benefitted by longer completion times that may have afforded the opportunity to consult external resources. In contrast, students with extended periods in completing tasks tended to be less sophisticated in their responses, possibly reflecting cognitive costs (Monsell, 2003) in switching between test and other non-test activities (e.g., using social media). While these results appear to support the practicality and reliability of the instrument, written feedback from a subset of students (30%) reported concern in the amount of time required to complete the tests. Thus, in an effort to prevent test fatigue and maintain student motivation in future performance evaluations, either a shorter test focused on a smaller set of skills or multiple short assessments would be recommended over a single longer test.

Measurement of change. Differences in pre/post-response correctness, as assessed by the scoring criteria, provided a direct measure of individual changes in student competencies.¶ To examine the longitudinal development of student skills, performance data were analyzed in two ways. First, descriptive statistics and qualitative examination were examined at the individual level to document students' baseline to post-URE differences for each item/skill (Box 3).

Second, pairwise comparisons were made to assess group-level changes in student scores before and after the experience. (Fig. 2).


	Fig. 2 Comparison of students (n = 24) total score on the PURE instrument by test administration. Error bars represent 95% CIs around the means.

For PBAs, differentiation in answer quality is essential to ensuring task and scoring criteria validity and reliability (Linn et al., 1991). Further, subtle differences were quantitatively and qualitatively observed in student data, which supports the utility of the instrument in providing reliable insight and evidence to the effect of UREs on the development of complex scientific thinking skills.

Box 1. Representative question from the case study focusing on the effects of the pesticide Atrazine on the sexual development of male frogs

In a literature search, you find a 2003 study by Hayes and colleagues on the effects of atrazine-induced hermaphroditism in male American Leopard frogs (Rana pipiens).¶ Leopard frogs and water samples were collected from a variety of natural habitats in areas with reported low and high use of atrazine in the Midwest USA. The water samples collected at each site were examined for atrazine contamination levels. Below is a modified excerpt from the Hayes et al. study regarding the initial chemical analysis:

At each site, we collected water (100 mL) in clean chemical-free glass jars for chemical analysis. Water samples were kept at room temperature (20–25 °C) until analysis. Duplicate samples were analyzed for atrazine levels from all sites by the Hygienic Laboratory (University of Iowa, Iowa City) and by PTRL West Inc. Research Lab (Hercules, CA). Water samples were extracted in organic solvent, followed by aqueous/organic extraction. The samples were comparably prepared at both labs with reagents of similar purities. Samples (between 3 and 4.25 mL) were analyzed by liquid chromatography/mass spectrometry (LC-MS), and compared to a standard reference. Both analytical laboratories were aware of the collection localities and shared their findings.

Question 3: In the modified chemical analysis excerpt above, please identify and describe all potential factors in the methodology and analysis that you feel could contribute to variation in the reported atrazine levels between and/or within sites. (Please explain why you identified these factors.)

Box 2

Box 3. Representative answer from a student (student A, new-URE participant, junior) to the explanation of a strategy to test the presence and concentration of atrazine in water samples

I would use tablets that changed the color of a solution depending on the concentration of the chemical in a sample. I would use this because I could compare the concentrations of atrazine in many different samples. (Pre-response, Level 1 – Basic)

AGC/MS would be best used to identify and measure the atrazine in the water. The MS would verify its identity, and the GC could help you determine in what concentrations the chemical is present. (Post-response, Level 3 – Adequate)

Conclusions and implications

Similar to other educational opportunities, the development and implementation of effective UR is dependent on accurately understanding how these experiences impact student learning. In the exploratory study described here, performance tasks and evaluative rubrics were designed, tested, and validated with the purpose of directly assessing changes in chemistry students' research skills over the duration of a URE. Developed through an iterative process informed by expert faculty and assessment literature, performance data were collected based on open-ended, complex, and generalizable questions situated in real-world scientific problems. Results demonstrated that the PURE instrument had high reliability and face validity in measuring chemistry students' EPS and QL skills, and can begin to provide direct quantitative and qualitative evidence to the contributions of URE participation on improving student competencies. One of the key contributions of this work, given the lack of URE studies using original direct measures, was to document differences in participants' answer quality over time, which provides support to the viability of generalizable performance instruments in assessing student research skills.

Faculty who may want to create their own assessments can use the design process detailed here to inform their instruments. The process (summarized in Table 1) outlines five steps that can guide faculty development and testing of performance-based probes. These general (and modifiable) steps include the: consultation of expert faculty in chemistry and education throughout the design process to establish validity, identification of target skill areas of interest for assessment, development of effective performance tasks that are carefully mapped or aligned with scoring criteria that permit the assessment of changes in skills over time, and the collection and analyses of qualitative and quantitative evidence to measure validity and reliability in testing the desired target skills. Through the use of performance data to identify skill difficulties, interventions can be reformulated or refined to enhance student learning.

Despite the apparent merit of PBAs in stimulating student thought and providing direct evidence to what they can (or can't) “do”, there are three noteworthy limitations of the PURE instrument described here. First, the small sample size (n = 24), while generally considered appropriate for a pilot study, limits rigorous psychometric and quantitative analyses that are desirable in the iterative process of assessment building as well as inference making about students' performance. Future research will be conducted with a larger sample size to inform revisions of the pilot PURE instrument into a more psychometrically sound measure of EPS and QL skills in advance of broader use in UR assessment. On balance, this study provides initial evidence to the use and viability of generalizable performance instruments in assessing student research skills. Second, the design, testing, and scoring of the PURE – and other performance instruments – is time intensive, which may dissuade faculty from using this type of assessment. However, it can be contended that the tradeoff in providing direct evidence to student performance in complex skill domains, such as problem solving, may outweigh the initial “costs” in time for PBA development and practice implementing scoring rubrics. Further, existing vetted instruments can be adopted (or adapted) to meet the needs of faculty in assessing student competencies, helping to decrease the amount of upfront time expended in the design process (Towns, 2010). Third, while substantial efforts were taken to design a generalizable instrument, it is reasonable to expect that there may be some unevenness in students' topical (i.e. Atrazine and Malaria case studies) or technical familiarity – which may favor students with prior exposure. Here, however, comparisons were not drawn between students as research skills were measured on an individual basis to gauge the contributions of URE participation to learning. In other words, students acted as their own “control” group as comparisons were drawn between one's baseline and post-URE skill performance.

Moving forward, research studies on student engagement in research experiences would be strengthened by complementing self-report data on one's competencies with direct documentation of skill progress. Performance evidence to student progress should prove useful to faculty refinement of UR by identifying strengths and gaps in student proficiencies. As this exploratory study demonstrates the usefulness of performance data in providing reliable insight to the effect of UR on chemistry students' skill development, we encourage others to consider developing this type of assessment to meaningfully evaluate the effectiveness of teaching activities and educational programs in support of student learning.

Acknowledgements

Funding for the Undergraduate Scientists: Measuring Outcomes of Research Experiences project was in part provided by the NSF under Award #1140445. Any findings, opinions, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. I would like to gratefully acknowledge John Esteb and Adam Maltese for their assistance as research advisors and in reviewing this manuscript. As well, I thank Russell Balliet, Ryan Bouldin, Nathaniel Brown, Sarah Pettito, Stacy O'Reilly, Mikaela Schmitt-Harsh, Sara Skrabalak, Mary Walczak, Robert Sherwood, Heather Reynolds, Joshua Danish, and Jennifer Warner for their contributions to this study.

References

American Association for the Advancement of Science, (2011), Vision and change in undergraduate biology education: a call to action, Washington, DC: American Association for the Advancement of Science.
American Chemical Society Committee on Professional Training (ACS), (2008), Development of Student Skills in a Chemistry Curriculum. (accessed November 2012 from: http://portal.acs.org/portal/PublicWebSite/about/governance/committees/training/acsapproved/degreeprogram/CNBP_025490).
American Educational Research Association, American Psychological Association, National Council on Measurement in Education [AERA/APA/NCME], (2014), Standards for educational and psychological testing, Washington, DC: American Psychological Association.
Boyer Commission on Educating Undergraduates in the Research University (Boyer Commission), (2008), Reinventing undergraduate education: Three years after the Boyer report, State University of New York-Stony Brook: Stony Brook, NY.
Bradford J. D. and Schwartz D. L., (1999), Rethinking transfer: a simple proposal with multiple implications, Rev. Res. Educ., 24, 61–100.
Brown N. J. and Wilson M. A., (2011), Model of cognition: the missing cornerstone of assessment, Educational Psychological Review, 23(2), 221–234.
Brown N. J., Nagashima S. O., Fu A., Timms M. and Wilson M., (2010), A framework for analyzing scientific reasoning in assessments, Educational Assessment, 15(3–4), 142–174.
Cooper M. M., (2007), Data-driven education research, Science, 317, 1171.
Corwin L. A., Graham M. J. and Dolan E. L., (2015), Modeling course-based undergraduate research experiences: an agenda for future research and evaluation, CBE-Life Sci. Educ., 14(1), es1.
Crowe M. and Brakke D., (2008), Assessing the impact of undergraduate-research experiences on students: an overview of current literature, CUR-Quarterly, 28(1), 43–50.
Dalkey N. C., (1972), The Delphi method: an experimental study of group opinion, in Dalkey N. C. Rourke D. L., Lewis R. and Snyder D. (ed.) Studies in the quality of life: Delphi and decision making, Lexington, MA: Lexington Books.
Feldon D. F., Maher M. A. and Timmerman B. E., (2010), Performance-based data in the study of STEM PhD education, Science, 329, 282–283.
Gormally C., Brickman P. and Lutz M., (2012), Developing a test of scientific literacy skills (TOSLS): measuring undergraduates' evaluation of scientific information and arguments, CBE-Life Sci. Educ., 11(4), 364–377.
Hair J., Anderson R., Tatham R. and Black W., (2006), Multivariate Data Analysis, NJ: Pearson/Prentice Hall, Inc.
Harsh J. A., Maltese A. V. and Warner J. M., (2013), The development of expertise in data analysis skills: an exploration of the cognitive and metacognitive processes by which scientists and students construct graphs, Boston, MA: American Association for the Advancement of Science (AAAS) National Meeting.
Hayes T. B., Collins A., Lee M., Mendoza M., Noriega N., Stuart A. A. and Vonk A., (2002), Hermaphroditic, demasculinized frogs after exposure to the herbicide atrazine at low ecologically relevant doses, Proc. Natl. Acad. Sci. U. S. A., 99(8), 5476–5480.
Holme T., Bretz S. L., Cooper M., Lewis J., Paek P., Pienta N. and Towns M., (2010), Enhancing the role of assessment in curriculum reform in chemistry, Chem. Educ. Res. Pract., 11(2), 92–97.
Hsu C. C. and Sanford B. A., (2007), The Delphi technique: making sense of consensus. Practical Assessment, Research, and Evaluation. 2007, (accessed November 2012 from http://pareonline.net/getvn.asp?v=12&n=10).
Ibrahim H., Imam I., Bello A., Umar U., Muhammed S. and Abdullahi S., (2012), The potential of Nigerian medicinal plants as antimalarial agents: a review, Int. J. Sci. Technol., 2(8), 600–606.
Laursen S., Seymour E., Hunter A. B. and Thiry H., (2010), Melton G. Undergraduate research in the sciences: Engaging students in real science, San Francisco: Jossey-Bass.
Linn R. L., Baker E. L. and Dunbar S. B., (1991), Complex, performance based assessment: expectations and validation criteria, Educ. Res., 20(8), 15–21.
Linn M. C., Palmer E., Baranger A., Gerard E. and Stone E., (2015), Undergraduate research experiences: impacts and opportunities, Science, 347(6222), 1261757.
Lopatto D., (2010), Science in solution: the impact of undergraduate research on student learning, Washington, DC: CUR and Research Corporation for Scientific Advancement.
Mehrens W. A., (1992), Using performance assessment for accountability purposes, Educational Measurements: Issues and Practice, 11(1), 3–9, 20.
Monsell S., (2003), Task switching, Trends Cognit. Sci., 7(3), 134–140.
National Council on Education and the Disciplines (NCED), (2001), Mathematics and Democracy, The Case for Quantitative Literacy, Washington, DC: The Woodrow Wilson National Fellowship Foundation.
President's Council of Advisors on Science and Technology (PCAST), (2012), Engage to Excel: Producing One Million Additional College Graduates with Degrees in Science, Technology, Engineering, and Mathematics; Report to the President. Executive Office of the President: Washington, DC.
Quellmalz E. S., (1991), Developing criteria for performance assessments: the missing link, Appl. Meas. Educ., 4(4), 319–331.
Raker J. R. and Towns M. H., (2012), Designing undergraduate-level organic chemistry instructional problems: seven ideas from a problem-solving study of practicing synthetic organic chemists, Chem. Educ. Res. Pract., 13(3), 277–285.
Rethans J. J., Norcini J. J., Baron-Maldonado M., Blackmore D., Jolly B. C., LaDuca T., Lew S., Page G. G. and Southgate L. H., (2002), The relationship between competence and performance: implications for assessing practice performance, Med. Educ., 36(10), 901–909.
Shadle S. E., Brown E. C., Towns M. H. and Warner D. L., (2012), A rubric for assessing students' experimental problem-solving ability, J. Chem. Educ., 89, 319–325.
Shavelson R., Baxter G. and Pine J., (1991), Performance assessment in science, Appl. Meas. Educ., 4(4), 347–362.
Stein B., Haynes A. and Redding M., (2007), Project CAT: assessing critical thinking skills, in Deeds D. and Callen B. (ed.) Proceedings of the 2006 National STEM Assessment Conference, MO: Drury University, Springfield.
Timmerman B., Strickland D., Johnson R. and Payne J., (2010), Development of a universal rubric for assessing undergraduates' scientific reasoning skills using scientific writing. [Online]. University of South Carolina Scholar Commons http://scholarcommons.sc.edu/ (accessed Aug 22, 2013).
Towns M. H., (2010), Developing learning objectives and assessment plans at a variety of institutions: examples and case studies, J. Chem. Educ., 87(1), 91–96.
Zoller U., (2001), Alternative assessment as (critical) means of facilitating HOCS-promoting teaching and learning in chemistry education, Chem. Educ. Res. Pract., 2(1), 9–17.

Footnotes

† See Towns (2010) for a review of cognitive, affective, and psychomotor measures used in chemistry.

‡ Student survey feedback later indicated that 71% and 65% of students, respectively, found the situating frog hermaphroditism and medicinal plant topics to be of interest.

§ It is worth mentioning that the intended population(s) for a measure has to be taken into account when designing or using as instrument as the properties of validity and reliability are established for the population it is tested on (AERA/APA/NCME, 2014).

¶ See Harsh, Esteb, and Maltese (in preparation) for a more extended description of URE participant learning progressions as measured using the PURE.

Click here to see how this site uses Cookies. View our privacy policy here.