Evaluating the development of chemistry undergraduate researchers’ scientific thinking skills using performance-data: first findings from the performance assessment of undergraduate research (PURE) instrument

Joseph Harsh; John J. Esteb; Adam V. Maltese

doi:10.1039/C6RP00222F

View PDF VersionPrevious ArticleNext Article

DOI: 10.1039/C6RP00222F (Paper) Chem. Educ. Res. Pract., 2017, 18, 472-485

Evaluating the development of chemistry undergraduate researchers’ scientific thinking skills using performance-data: first findings from the performance assessment of undergraduate research (PURE) instrument†

Joseph Harsh *^a, John J. Esteb ^b and Adam V. Maltese ^c
^aJames Madison University, Harrisonburg, Virginia 22807-0001, USA. E-mail: harshja@jmu.edu
^bButler University – Chemistry, Indianapolis, Indiana, USA
^cIndiana University Bloomington, Bloomington, Indiana, USA

Received 1st November 2016 , Accepted 23rd March 2017

First published on 28th March 2017

Abstract

National calls in science, technology, engineering, and technology education reform efforts have advanced the wide-scale engagement of students in undergraduate research for the preparation of a workforce and citizenry able to attend to the challenges of the 21st century. Awareness of the potential benefits and costs of these experiences has led to an emerging literature base outlining gains in participants’ cognitive, affective, and conative domains to support the impact of undergraduate research for students of all backgrounds; however, the majority of this work has relied on self-report data limiting inferences to the causal effects on student learning. As part of a larger project on apprentice-like undergraduate research experiences (UREs) in the physical sciences, the present exploratory study complemented indirect self-report data with direct performance data to assess the development of chemistry students’ scientific thinking skills over a research experience. Performance data were collected using the Performance assessment of Undergraduate Research Experiences (PURE) instrument, a validated tool designed to assess changes in chemistry students’ analytical and data driven decision-making skills through open-response tasks situated in real-world problems from primary literature. Twenty-four summer research students in chemistry (46% women; 50% 1st/2nd year students; 42% first time URE participant) from seven colleges and universities provided baseline and post-intervention performance data. Differences in pre/post-response task correctness provided a direct measure of individual changes in student competencies. Early study findings indicate the positive contributions of UREs to student's competencies in the areas of problem-solving, experimental design and the use of research techniques, data analysis and the interpretation of results, and the evaluation of primary literature. Survey data were also collected on students’ self-skill ratings to allow comparisons between perceived and demonstrated competencies, which were found to be weakly correlated. This work begins to offer direct evidence to the effect of UREs on student learning progressions as well as the potential use of performance test data in evaluating the success of research training interventions designed to improve scientific thinking skills.

1. Introduction

Undergraduate research (UR)‡ is an increasingly popular, and called for, educational opportunity to engage students in authentic field-based practices for the development of cognitive skills and abilities to do science in preparation for science, technology, engineering, and mathematics (STEM) careers (Boyer Commission on Educating Undergraduates in the Research University, 2008; American Association for the Advancement of Science [AAAS], 2011; President's Council of Advisors on Science and Technology [PCAST], 2012; National Academies of Science, Engineering, and Medicine [NAS/NAE/NAM], 2017). Recent research and evaluation studies have outlined numerous conferred (or as suggested by Weston and Laursen (2015), “promised”) benefits for UR participants in enhancing research-related skills, understanding of the research process, science affect (e.g., self-confidence, self-efficacy), personal skills (e.g., time management), and for graduating and retaining students in STEM degree programs and careers (as reviewed by Sadler and McKinney, 2008; Laursen et al., 2010; Lopatto, 2010; Linn et al., 2015). However, while much research has investigated the efficacy and impact of undergraduate research experiences (UREs) and course-based undergraduate research experiences (CUREs) on participants’ cognitive, affective, and conative domains, there is little direct evidence to the causal effects of these experiences on student learning (Crowe and Brakke, 2008; Linn et al., 2015; NAS/NAE/NAM, 2017). In a sampling of 60 UR-focused articles,§ most relied on indirect self-report (surveys and interviews) and institutional (e.g., GPA, degree completion) data from active or recent participants, which can lead to difficulties in appropriately estimating the effect of these experiences on the development of one's cognitive skills (Gonyea, 2005; Cox and Androit, 2009; Bowman, 2010). As identified by Seymour et al. (2004), “the data-gathering problems [of UR evaluations] leave a black box between program goals and activities on one hand, and the outcomes claimed on the other” (p. 497).

The need to rigorously assess changes in UR participants’ disciplinary research skills, which are critical outcomes in the preparation for STEM careers (National Research Council [NRC], 2005), has led to recent calls for generalizable performance-based assessments (PBAs) that provide direct evidence to student learning (Feldon et al., 2010; AAAS, 2011; Linn et al., 2015; NAS/NAE/NAM, 2017). Through authentic tasks that are valued in their own right, PBA requires the application of knowledge in the construction of an original response (Linn et al., 1991; Mehrens, 1992; Zoller, 2001). Given the “real-world” nature of the assessment, in which the individual directly demonstrates their knowledge in context, performance measures are often assumed to have higher validity than indirect measures (Linn et al., 1991; Mehrens, 1992; Miller and Linn, 2000; Feldon et al., 2010). Common examples of PBAs include scientific writing, oral presentations, oral and written tests of skills, and direct observation. However, while PBA is a long-standing (Linn et al., 1991) and widely endorsed assessment practice in K-16 science (Slater and Ryan, 1993; AAAS, 2011), the use of performance data to evaluate the development of postsecondary students’ research training has been limitedly adopted (Feldon et al., 2010). This relative absence in undergraduate science may reflect challenges such as the lack of research competency standards in higher education, constraints of instructor time and resources, the inherent complexity of capturing skill development over time, that assessment of these skills are largely ignored, and the inherent difficulties involved in creating these measures (Rueckert, 2008; Feldon et al., 2010; Timmerman et al., 2010). In response, as a positive step forward, a growing base of vetted performance instruments in the literature (e.g., Gormally et al., 2012) and from evaluation groups (e.g., the Critical Assessment of Thinking [CAT] group at Tennessee Technological University; James Madison University Center for Assessment and Research Studies [CARS]) are becoming more available to faculty interested in assessing student learning in their classes.

Though the importance of UREs to student training in the sciences is without question, “it should not simply be assumed that a hands-on scientific task encourages the development of problem solving skills, reasoning ability, or more sophisticated mental models of the scientific phenomenon” (Linn et al., 1991, p. 19). Surprisingly, aside from discussions of potential means to assess UREs using performance data (Willison, 2009; Dasgupta et al., 2014), there is a paucity of direct evidence in the literature on the trajectories of research-related skill development for UR participants in the sciences (Crowe and Brakke, 2008; Linn et al., 2015). While validated instruments are available to measure students’ critical thinking or experimental problem-solving abilities, it can be argued that the majority of these tools are limited in evaluating the success of UREs on student learning due to a generalizable focus on non-science specific competencies (e.g., Stein et al., 2007), topic-specific problems that may limit generalizability (e.g., Shadle et al., 2012), and closed-response designs (e.g., Gormally et al., 2012) that may not appropriately represent task authenticity or capture complex cognitive skills (Stein et al., 2007).

In this article, we describe the early findings of the Performance assessment of Undergraduate Research Experiences (PURE) instrument, a performance test designed to directly measure changes in chemistry students’ scientific thinking skills (STS) over a summer URE. Designed to complement self-report data as part of a larger national project on UREs, the PURE instrument is a validated measure consisting of 16 multipart open-ended tasks contextualized around real-world chemistry problems and associated criterion-based rubrics focused on experimental problem solving (EPS) and quantitative literacy (QL) skills. Specifically, this pilot study was guided by the research questions: (A) How do students’ EPS and QL skills change during a URE? and (b) What is the level of agreement between URE students’ skill ratings and demonstrated performance? It was anticipated that skill changes for students would be discernible at the end of their respective URE, and that there would be moderate overlap in students’ self-report and performance data.

While the need for tools to directly measure URE student gains has been widely noted in the literature (Linn et al., 2015), to the best of our knowledge, this is the first study dedicated to assessing the development of participants’ disciplinary (chemistry) research skills. We feel this work advances our understanding of URE gains and may guide faculty, administrators, and other STEM researchers interested in investigating the efficacy and impact of UR. Similar to other educational innovations, accurately measuring students’ learning progressions over a research experience is essential for making decisions about programmatic refinement in support of learning outcomes (Kardash, 2000; Delatte, 2004).

2. The PURE instrument

2.1. PURE description

This exploratory study was designed to complement the ongoing national project, Undergraduate Scientists: Measuring Outcomes of Research Experiences (US-MORE; NSF#1140445), that uses a mixed methods approach to better understand the nature of these experiences and how they benefit students in the physical sciences. To provide an encompassing view of UREs, between summer 2012 and summer 2016, the project has collected various quantitative (surveys) and qualitative (interviews, point-of-view camera recordings, weekly journals) data from students (n = 498) and faculty/lab mentors (n = 90) at 38 institutions. Based on the information collected in the larger US-MORE study and our prior work in the field (Harsh et al., 2011, 2012) pertaining to short- and long-term URE participant skill gains, we sought to complement our indirect data collection with a performance instrument to more directly assess (and validate) student learning progressions.

The PURE instrument was designed to characterize the effect of UR on 11 scientific thinking skills (STSs) in the areas of experimental problem solving and quantitative literacy (see Table 1). As defined by the American Chemical Society (ACS, 2008) experimental problem solving (EPS) includes the ability to “define a problem clearly, develop testable hypotheses, design and execute experiments, analyze data, and draw appropriate conclusions” (p. 1). Quantitative literacy (QL), as defined by the National Council on Education and the Disciplines (NCED, 2001), is a habit of the mind, the knowledge and skills necessary to effectively engage in daily and scientific quantitative situations (e.g., reading visual data, understanding basic statistics presented in a study). These skills were selected as they are (a) highly valued by science faculty and reform organizations for scientific literacy (AAAS, 2011; Gormally et al., 2012), (b) reflect authentic practices used by practicing scientists (e.g., Kirschner, 1992; ACS, 2008), and (c) are gains commonly reported in URE literature (e.g., Laursen et al., 2010) that were specifically focused on in the survey data of the larger project.

Table 1 Target skills measured by the PURE instrument and US-MORE survey

PURE target skills	Explanation of skill	Associated survey item
Understand methods of inquiry that lead to scientific knowledge
Understand how to search for information	Identify databases to search for information in the field relevant to a problem	“Conducting searches for literature related to your project”
Understand research design	Identify a research design to experimentally address a scientific problem	“Developing your own research plan”
Understand research techniques/instrumentation	Identify an instrument-based strategy to address a chemical problem	“Using advanced research techniques in your field”
Understand how research design may influence scientific findings	Identify strengths and weaknesses of research design elements (e.g., potential sources of error, variables, experimental controls)	Identify strengths and weaknesses of research design elements (e.g., potential sources of error, variables, experimental controls)
Troubleshoot technical issues	Evaluate a scientific problem to identify possible technical causes	“Troubleshooting theoretical/technical errors in research during data collection”

Interpret, represent, and analyze quantitative scientific data
Represent data in a visual form	Convert relevant information into an appropriate visual representation given the type of data	“Representing data in a visual form common for the research field”
Interpret visual representations of data	Interpret or explain information presented in visual forms	“Interpreting visual representations of data”
Understand basic statistics	Understand the need of basic statistics to quantify uncertainty in data or draw conclusions	“Interpreting statistical analysis of research”

Evaluating scientific information
Evaluate evidence and critique experimental designs	Understand the limits of correlational data and experimental design elements	“Interpreting and critiquing results and findings presented in the literature”
Identify additional information needed to evaluate a hypothesis/interpretation	Explain how new information may contribute to the evaluation of a problem	“Identifying further information necessary to support research-related results in the literature”
Provide alternative explanations for results that may have many causes	Recognize and explain possible alternative interpretations for given data or observations	N/A

2.2. PURE design and implementation

The PURE instrument was designed and tested through an iterative process informed by assessment literature (e.g., Linn et al., 1991; Raker and Towns, 2012), existing performance instruments (e.g., Stein et al., 2007; Gormally et al., 2012; Shadle et al., 2012), field-testing, and experts in chemistry and science education.¶ The final instrument consisted of 16 multipart questions (30 total evaluative tasks with associated criterion-based rubrics) and designed to take 60 minutes for students to complete in an effort to limit the potential effects of cognitive fatigue on test performance (Ackerman and Kanfer, 2006). For the purpose of test authenticity and meaningfulness, the tasks were situated in two real-world problems (i.e. the potential effect of atrazine, a common pesticide, on the sexual development of amphibians and the antimalarial properties of compounds found in medicinal plants). Questions were largely designed in an open-response, two-tier fashion that prompted students to (1) provide a “best” response to solve a problem, and (2) explain their thought patterns for the response. See Box 1 for a representative task and scoring criteria.

Box 1 Representative question and scoring criteria from the case study focusing on the effects of the pesticide atrazine on the sexual development of male frogs.

In an effort to make the PURE tasks as generalizable as possible, the instrument was developed based on: (a) feedback from URE faculty mentors (n = 18) about the research practices and skills their students are commonly engaged in or expected to develop, (b) prior research on URE outcomes, (c) test items that permitted a range of responses, and (d) the use of multiple tasks to measure each target skill to improve test generalizability (Linn et al., 1991). Information was also obtained using a short set of supplement questions on student views of the PURE tasks (e.g., topical familiarity, recommendations for future test drafts) and relevant experiences that may have influenced their performance (e.g., technical practices).

2.3. PURE administration

Purposeful sampling was conducted in an effort to gain insight into the central phenomenon of interest through information-rich cases from which the most can be learned (Merriam, 2009). In Summer 2013, all chemistry students (n = 54) at seven institutions from the US-MORE project were invited to participate in this study. Based on early US-MORE data collected at five of the sampled institutions, it was anticipated that the invited participants would provide an adequate cross-section (e.g., gender, academic stage, prior URE participation) for study. In addition to diversity in educational and demographic backgrounds, it was also recognized that the students, even those working on the same project, would likely have differing research experiences (e.g., activities, mentorship). While this is a challenge for evaluative purposes due to a lack of general standardization, we felt inviting all chemistry student researchers would maximize the potential to obtain information from as diverse set of participants as possible – which would best reflect the natural variation we see in UR.

The PURE test was administered to students at the start and end of their research experiences using Qualtrics survey software to provide baseline and post-intervention data for assessing changes in response quality. Students were instructed to answer the questions independently within a one-week period without the use of outside resources, and, if possible, to complete all tasks in one sitting. For the purpose of submitting the hand-drawn graphical data representations, students were asked to send the images electronically (i.e. high quality photos from a camera or smart phone, computer scanner) or via mail. IRB approval was obtained in advance of the study from Indiana University.

2.4. PURE scoring

PURE test data were stripped of identifiers and scored by multiple trained raters with graduate degrees in science.¶ Overall, interrater reliability, as estimated by ICC (95% CI), indicated a high average reliability in scoring across questions at 0.92. In other words, these results indicate that 92% of variation in total scoring was attributed to actual differences in response quality rather than other inconsistencies. Differences in pre/post-response quality, as assessed by the scoring criteria, provided a direct measure of changes in participant competencies. Based on the two-step design described above, in which students were typically prompted to first provide a “best” response to a given real-world scenario where there maybe multiple “correct” answers and then offer a rationale regarding why they chose to respond in such a manner, each step was independently evaluated. As an example, for the technique-based questions, student responses were criteria-rated for (a) the appropriateness of the selected technique for use in solving the provided problem and (b) then for the reasoning in selecting this given technique independent of the method they selected (i.e. without penalty for selecting an inappropriate technique). Student performance data were analyzed using descriptive statistics and qualitative examinations at individual level and group-level pair-wise comparisons to assess longitudinal changes. Data were handled using IBM SPSS (v.20).

2.5. USMORE survey

We collected data on students’ backgrounds and educational experiences as well as their research activities, career interests, and perceived ability in research skills as part of the larger project. These data were collected via online survey (∼20 minutes) administered to participants pre- and post-experience with the purpose of evaluating URE outcomes. Survey items were based on our prior research (Harsh et al., 2011, 2012) and published UR surveys and interviews (e.g., Kardash, 2000; Hunter et al., 2004; Lopatto, 2007), which were modified to provide greater detail about participants’ cognitive and affective gains (Maltese et al., in preparation). While we collected a wide range of survey data, using open- and closed-ended questions, only data concerning students’ Likert-type self-ratings relevant to this study are discussed here as an indirect measure of skill development and for comparison to the performance data. Table 1 includes 10 survey items focused on here given their relevance to the EPS and QL target skills in the PURE instrument. Survey data were handled and analyzed in IBM SPSS (v.20).

3. Early findings

3.1. Participants

Thirty-four students (63% response rate) provided pre-URE data, and 24 completed the partnered post-test (74% retention rate). While data from all participants were used to characterize the instrument's psychometric properties (Harsh, 2016), the estimated effects of UREs on student skill development discussed here are based on the responses from students that completed both the pre- and posttest. The sample that completed both the pre- and posttests was comprised of 24 students from 7 institutions demonstrating relatively equal distributions in gender (54% men), academic stage (50% 1st/2nd year students), research experience (42% first time URE participants), institutional type (50% conducting research at research extensive universities), and postgraduate plans (71% STEM career, 25% non-STEM career, and 4% undecided). Student backgrounds were comparable across groups that did and did not complete the posttest as was their respective pretest scores, which suggests comparable proficiencies and that attrition was not linked to poor initial test performance. While this sample is not of magnitude to make broad generalizations, it is reasonable for such an exploratory performance-based study (e.g., Stein et al., 2007). Given the intensity of the students’ summer URE programs (∼40 hours per week on average), as taken from student and mentor accounts as well as programmatic policies concerning outside activities, few (if any) completed coursework or held other science-related employment during their research experience that may have influenced their test performance. While the online test was designed and field tested to take 60 minutes to complete, on average, student test time was 105 minutes (S.D. 82 minutes). As informally commented on by several students, it is likely that this variation can be attributed to time spent “off-task” (e.g., social media, food preparation) despite instructions to complete the online test in a single uninterrupted sitting. Participants received a modest stipend to complete each test administration in recognition of their time.

3.2. Performance evidence to the effects of UREs on student skills

Evidence to the effect of UREs on EPS and QL skill improvement was based on the analysis of differences in students’ pre- and posttest performance. As rubric scoring for most items is associated with mapped levels of proficiency, changes in scores are assumed to represent student competencies when taken collectively with other comparable skill-based tasks (Linn et al., 1991). To provide a broad view to the potential impact of UREs on student learning, early findings at both the individual and group levels are presented here.

3.2.1. Individual student level findings. Given the potential unevenness in student backgrounds and research experiences, research skills were measured on an individual basis to gauge the contributions of URE participation to learning – in other words, students acted as their own control group as comparisons were drawn between one's baseline and post-URE performance. Differences in student responses were initially examined by the number of students demonstrating given gains (or losses) per target skill and item, which allowed for the observation of trends across participants. Table 2 shows largely positive patterns of change in students’ pre- and posttest scores by skill category.

Table 2 Distribution of students’ point-wise change in the target skill areas as a function of time

Target skill	Point-wise change (±) between pre- and posttest
Target skill	−2	−1	0	+1	+2	+3	+4	+5	+6	+7	+8	+9	+10
Understand how to search for information			16	5	3
Understand research design		2	1	6	5	3	4		1		2
Use (or understanding when to use) research techniques and instrumentation		1	2	6	3	4	3	3	2			2
Understand how research design may influence scientific findings	1	2	4	2	2	2		4	1	3	2		1
Identify or troubleshoot technical issues in data collection		1	4	4	4	7	2	2	1
Evaluate evidence and critique experimental design to evaluate hypotheses	1	1	5	6	7	2	1			1
Identify additional information needed to evaluate a hypothesis/interpretation		3	3	2	4	3	2		4	2	1
Provide alternative explanations for results that may have many causes		1	7	3	4	3	2	3	1
Represent data in a visual form	2		5	3	1		7	3	1	1		1
Read and interpret visual representations of data	2	2	5	7	2	4	2	1
Understand basic statistics		2	13	4	2	1

As to be expected given the likely idiosyncrasies in students’ research experiences, the extent of pointwise gains per target skill and item, if any, was found to be highly variable. As a specific example,|| for the task identify and explain an instrument-based strategy to address a chemical problem for the ATZ case study, which was used in measuring the broader skill area of using research techniques and instrumentation (Table 2), 7 students (29%) scored one point higher on the posttest, 3 (13%) two points higher, 3 (13%) three points higher, 1 (4%) four points higher, 1 (4%) five points higher, 2 (8%) demonstrated no change, 1 (4%) scored one point lower, and 4 (17%) scored three points lower (see Box 1 for the test item and associated scoring rubric). To qualitatively demonstrate longitudinal changes in student's pre- and posttest data, representative answers for this example task (with scored levels of proficiency) are presented in Boxes 2 and 3.

Box 2 Representative answer from a student (Student A, new-URE participant, senior) to the explanation of a strategy to test the presence and concentration of atrazine in water samples (Q.1) demonstrating a one point (level) gain.

Box 3 Representative answer from a student (Student B, new-URE participant, junior) to the explanation of a strategy to test the presence and concentration of atrazine in water samples (Q.1) demonstrating a multipoint (level) gain.

In this particular task, students needed to clearly demonstrate an understanding of the use of an instrumentation-based strategy to solve a real-world chemistry problem by addressing the relevant features of the technique. It is important to point out that student responses were evaluated as to why they identified a given strategy independent of the “correctness” of the selected technical approach. Student A provided a set of answers that demonstrated a small level of change by presenting a more focused justification of her strategy on the posttest; however, there is no detail provided to how the approach will be used to solve the problem in terms of compound identification. In comparison, Student B's set of answers demonstrated a higher level of change as technical errors on the use of instrumentation (i.e. mass spectrometry alone is not an appropriate tool to determine the relative components of a mixture) are included in the pretest, which are corrected with a more complete justification or explanation for the strategy in the posttest response. Such differences may reflect a number of student factors (e.g., research activities, prior knowledge) and suggests the idiosyncratic nature of skill development in these experiences. While it can be argued that the higher level of sophistication between Student B's pre- and posttest reasoning and technological knowledge may be attributed to his potential use of this strategy over the URE, the ability to appropriately transfer new information to a new situation does support that learning occurred for the participant (Bradford and Schwartz, 1999).

Interestingly, as seen in Table 2, a small proportion of students (between 4% and 21%) on several items scored lower on the posttest than the pretest. Most commonly, in the case of these “losses”, students scored one point less on the post-than pretest. As it is not likely that students regressed in their capabilities over a short 8 to 10 week URE period, the results suggest that these differences may often reflect random effects such as being less attentive in task completion, which is represented in Box 3.

In this instance, Student C simply provided less detail in his justification for how the identified strategy would help solve the problem (i.e. compound detection) on the posttest that the pretest. However, in comparison, it was noted that a subset of students attempted to “over-transfer” newly acquired technical knowledge to novel situations where it may not be appropriate. See Box 5 for a representative example as Student D shifted his pretest response from the “best” analytical technique (gas chromatography-mass spectrometry [GCMS]) for testing the presence and concentration of different compounds in a mixture (i.e. atrazine in lake water) to a less appropriate technique for the conclusive identification of a compound (ultraviolet-visible spectroscopy [UV-Vis]), which the student used in his work.

Box 4 Representative answer from a student (Student C, new-URE participant, senior) to the explanation of a strategy to test the presence and concentration of atrazine in water samples (Q.1) demonstrating a one point (level) “loss”.

Box 5 Representative answer from a student (Student D, new-URE participant, senior) in the identification of a strategy to test the presence and concentration of atrazine in water samples (Q.1) demonstrating a shift from the “best” instrument to a less appropriate one.

In these occasional instances of students trying to leverage new knowledge inappropriately (∼9% of total responses across technique-focused items), it suggests that they may not fully understand the newly learned concepts and procedures. While infrequent in the data, direct evidence to such gaps in student proficiencies is of particular value for providing insight to how learning can be improved.

3.2.2. Group level findings.
3.2.2.1. Total PURE test scores. As the assumption of normality in students’ pooled pre- and posttest data was met through a Shapiro–Wilk test (p > 0.05, skewness = −0.395, kurtosis = 0.0194), paired sample t tests were used to examine group differences over time for the targeted skills. Given the small sample size (n = 24), the Wilcoxon signed-rank test, a non-parametric test for comparing two related samples, was also calculated for triangulation with the paired t test estimates.** To correct for the possibility of increased rates of false positives from multiple comparisons, the false discovery rate (FDR) procedure was applied to control this effect for the 11 targeted skills. In addition, effect sizes, which quantify the standard difference between means, were estimated (Cohen, 1992). Comparisons were first made between total pre- and posttest scores, then by the large target skill test categories identified in Table 1.

Group-level pairwise comparisons between test administrations revealed posttest scores (M = 92 [maximum of 146 points], SD = 12) were significantly higher than those on the pretest (M = 72, SD = 9), t(24) = 11.23, Z = 3.72, p < 0.001 (Fig. 1). The estimated effect size was d = 2.65, indicating that the treatment effect was of large magnitude (d > 0.8) for contributing to the development of participants’ STSs (Cohen, 1992).


	Fig. 1 Comparison of students’ (n = 24) total score on the PURE Instrument with data conceptualization questions by test administration. Error bars represent 95% CIs around the means. Maximum 146 points possible. Figure replicated from Harsh (2016) with permission.

Analysis of the mean differences in participants’ pre/posttest scores are first described by large target skill category (i.e. understanding methods of scientific inquiry, analyzing scientific data, and evaluating scientific information), and then by the nested EPS and QL target skills (Tables 3–5). Significance tests were adjusted for multiple comparisons using Benjamini and Hochberg's False Discovery Rate (Benjamini and Hochberg, 1995) procedure, at the 0.05 level.

Table 3 Mean pre- and posttest scores for the large skill category Understanding Methods of Scientific Inquiry with calculated t-test, p value, adjusted p value, and effect size

Target skill category	Points possible	Pretest mean^c (SD)	Posttest mean^c (SD)	Prob. of difference (t-test)	p value^a	FDR adjusted p value^b	Effect size (Cohen's d)
a p value at 95% CI. b Target skill p values adjusted using the BH FDR procedure to control for multiplicity. c n = 23 students.
Understand how to search for information in the field	3	1.75 (1.17)	1.98 (1.05)	1.06	0.302	0.050	0.21
Understand research design	11	5.25 (2.48)	7.52 (1.86)	4.90	0.0001	0.005	0.80
Understand research techniques	15	4.43 (2.94)	6.7 (2.83)	3.47	0.0001	0.005	0.68
Understand how research design may influence scientific findings	15	6.46 (2.74)	9.45 (2.67)	3.94	0.001	0.027	0.78
Troubleshoot technical issues in data collection	16	2.83 (2.04)	4.94 (2.48)	5.69	0.0001	0.005	1.17
Total for large skill category: understanding methods of scientific inquiry	60	19.86 (5.70)	30.05 (5.97)	8.65	0.0001	—	1.66

Table 4 Mean pre- and posttest scores for the large skill category Evaluating Scientific Information with calculated t-test, p value, adjusted p value, and effect size

Target skill category	Points possible	Pretest mean^c (SD)	Posttest mean^c (SD)	Prob. of difference (t-test)	p value^a	FDR adjusted p value^b	Effect size (Cohen d)
a p value at 95% CI. b Target skill p values adjusted using the BH FDR procedure to control for multiplicity. c n = 24 students.
Evaluate evidence and critique experimental designs	11	6.79 (1.93)	7.98 (1.83)	3.00	0.006	0.036	0.61
Identify additional information needed to evaluate a hypothesis/interpretation	12	4.56 (1.85)	7.17 (3.11)	4.61	0.001	0.027	0.94
Provide alternative explanations for results or relationships that may have many causes	6	1.70 (1.52)	3.37 (1.56)	3.41	0.0001	0.005	0.69
Total large skill category: evaluating scientific information	29	13.1 (2.53)	18.56 (4.13)	7.10	0.0001	—	1.49

Table 5 Mean pre- and posttest scores for the large skill category Interpreting, Representing, and Analyzing Quantitative Scientific Data with calculated t-test, p value, adjusted p value, and effect size

Target skill category	Points possible	Pretest mean^c (SD)	Posttest mean^c (SD)	Prob. of difference (t-test)	p value^a	FDR adjusted p value^b	Effect size (Cohen's d)
a p value at 95% CI. b Target skill p values adjusted using the BH FDR procedure to control for multiplicity. c n = 21.
Represent data in a visual form	40	27.24 (3.28)	30.69 (3.19)	5.61	0.0001	0.005	0.59
Read and interpret visual representations of data	12	6.13 (1.56)	7.02 (2.37)	3.17	0.036	0.045	0.65
Understand basic statistics	5	3.25 (1.59)	3.71 (1.76)	2.31	0.031	0.041	0.47
Total for large skill category: interpret, represent, and analyze quantitative data	57	36.65 (2.5)	41.23 (4.16)	5.72	0.000	—	1.29

3.2.2.2. Understanding methods of scientific inquiry. Table 3, examines students’ pre- and posttest scores in the large skill category of understanding methods of scientific inquiry, which was measured by 11 test questions in five target skill areas (with a total of 60 maximum points possible [MPP]). Pairwise comparisons were made without an identified outlier at the higher end of the pretest scoring scale (z > 3), as the more conservative estimate, indicated posttest scores were significantly higher than pre-scores in the area of understanding methods of scientific inquiry (p < 0.05) with a large effect size (d = 1.66).

For the five scientific inquiry-related target skill areas (Table 3), post hoc pairwise comparisons revealed that the pretest scores for the target skills of understand research design, understand research techniques to solve a problem, understand how research design may influence scientific findings, and troubleshoot issues in data collection, were significantly lower than for the posttest. Effect sizes were medium to large in magnitude for understand research techniques (d = 0.68), understanding how research design may influence findings (d = 0.78), understand research design (d = 0.80) and troubleshoot issues (d = 1.17). There were no significant differences between pre- and posttest data for the target skill understanding how to search for information in the field.

3.2.2.3. Evaluating scientific information. The category evaluating scientific information includes six test items in the target skills of evaluating the strength of scientific claims, identifying information to evaluate scientific claims, and generating alternative explanations for complex scientific problems. Results from the paired t test revealed students’ posttest scores were significantly higher than those on the partnered pretest in the group of skills associated with evaluating scientific information (Table 4). Effect size for this category (d = 1.49) was found to be of large magnitude on student skills. Post hoc pairwise comparisons of pre- and posttest data indicated students’ scores were statistically different for the target skills evaluate evidence and critique experimental designs (p = 0.027), identify additional information needed to evaluate a hypothesis/interpretation (p = 0.005), and provide alternative explanations for results that may have many causes (p = 0.027). Analysis of these target skills revealed a large effect size for identify additional information needed to evaluate a hypothesis/interpretation (d = 0.94) and medium effect sizes for evaluate evidence and critique experimental design (d = 0.61) and provide alternative explanations for results that may have many causes (d = 0.69).
3.2.2.4. Interpreting, representing, and analyzing quantitative scientific data. Test data from 9 items in three skill areas (i.e. representing data in a visual form, reading and interpreting visual data representations, and understanding basic statistics) were analyzed for the large skill category of interpreting, representing, and analyzing quantitative scientific data (Table 5; MPP = 57). With three outlier students identified at the low end of the point scale removed to provide a more conservative estimate (z < −3), student posttest scores were statistically higher than those on the pretest (p < 0.001) with a large effect size (d = 1.29). Results indicated that student scores in the related skills of representing data in a visual form, interpreting visual representations of data, and understanding basic statistics were significantly higher on the posttest than the pretest with p values all less than 0.05. Examination of effect size indicated that the treatment effect was moderate in magnitude for represent data in a visual form (d = 0.59) and interpret visual representations of data (d = 0.65) and of small magnitude for understand basic statistics (d = 0.47).

3.3. Alignment of students’ self-ratings and performance data

The agreement between student performance data and self-skill ratings was examined using multiple data sources. Differences in students’ pre- and post-URE survey ratings using 5-point Likert-type scales were examined to evaluate self-identified skill changes. Scale ratings reflected the degree of student independence in task completion, which faculty feedback in the larger study regularly highlighted as a key indicator of student skill proficiency, where: 1 = No Experience, 2 = Not Comfortable in completing, 3 = Can complete with substantial assistance, 4 = Can complete independently, and 5 = Can instruct others how to complete (Maltese et al., in review). Overall, student data suggested the positive contributions of the research experience on skill development (Fig. S3 in ESI†). Notably, students rated their abilities at the end of URE at least one level higher than at the start, across 4 of the 10 EPS and QL skills that align with PURE target skills, including: identify strengths and weaknesses of research design elements (67% of students), developing your own research plan (58%), using advanced techniques in the field (58%), and troubleshooting errors during data collection (58%). Review of the remaining skills revealed smaller numbers of students identifying positive changes: representing data in a visual form (38% of students), interpreting data visualizations (29%), interpreting and critiquing results and findings in the literature (29%), interpreting statistical analysis (25%), identifying further information necessary to support research-related results in the literature (25%), and conducting literature searches (25%). Of which, it can be anticipated that students’ lower self-recognized gains in these less “hands-on” research-related areas may, at least in part, reflect variation in their research experience (e.g., project-emphasized skills, mentor focus on the development of given skills) as well as perceived pre-existing competencies in these skills due to earlier exposure (e.g., data visualization in coursework) that may have resulted in the “pegging out” of the Likert-scale responses limiting identified changes.

Changes in students’ pre- and post-URE self-ratings and performance scores for the shared skills were examined to evaluate the extent of agreement between the indirect and direct measures. As seen in Table 6, the correlations between self-report and performance data for all skills were found to be moderate to weak (r < 0.40). To some degree, the low correlation between measures was anticipated due to multiple reasons, including the potential bias in rating or inconsistency in self-rating one's abilities, that students’ perceived abilities in their respective research experience may not fully align with the more generalized performance tasks lack of alignment between test tasks and target skills, and student “losses” in the performance data. In addition, the use of non-equivalent scales across measures (i.e. 5-point Likert-type survey items v. performance scores across grouped target skill questions) may have affected the comparison.

Table 6 Correspondence between students’ performance and self-report data

PURE target skills	Associated survey item	Pre/post-change
PURE target skills	Associated survey item	r
Understand how to search for information in the field	“Conducting searches for literature related to your project”	−0.319
Understand research design	“Developing your own research plan”	0.178
Use research techniques and instrumentation	“Using advanced research techniques in your field”	−0.25
Understand how research design may influence scientific findings	Identify strengths and weaknesses of research design elements (e.g., potential sources of error, variables, experimental controls)	0.201
Troubleshoot technical issues in data collection	“Troubleshooting theoretical/technical errors in research during data collection”	0.112
Represent data in a visual form	“Representing data in a visual form common for the research field”	0.361
Interpret visual representations of data	“Interpreting visual representations of data”	−0.063
Understand basic statistics	“Interpreting statistical analysis of research”	0.171
Evaluate evidence and critique experimental designs to evaluate hypotheses	“Interpreting and critiquing results and findings presented in the literature”	0.193
Identify additional information needed to evaluate a hypothesis/interpretation	“Identifying further information necessary to support research-related results in the literature”	0.184

4. Discussion

Preliminary evidence to the development of chemistry students’ STSs over a summer URE was obtained from changes in PURE test performance as rubric scoring for each item was aligned to mapped levels of proficiency. Differences in student pre- and posttest performance varied by target skill focus as reported above. For performance-based assessments, differentiation in answer quality is essential to ensuring task and scoring criteria validity and reliability (Linn et al., 1991). Further, subtle differences were quantitatively and qualitatively observed in student data, which supports the utility of the PURE instrument to document individual changes in skills over time.

Longitudinal analysis of students’ baseline and post-URE responses revealed the idiosyncratic nature of skill development, and that, on average, students made individual improvements in ∼8 of the 11 possible measured target skills. These results align with the broad outcomes commonly outlined in prior literature examining URE skill contributions using self-report data (as reviewed in Laursen et al., 2010), and extends current knowledge by providing direct evidence of participant learning progressions. It should be noted that a smaller, but noteworthy, group of students per item (4% to 21%) scored lower on the posttest than the pretest. As it is unlikely that students “lost” knowledge during the 8 to 10 week URE period, we believe it is reasonable to assume that these results may reflect random effects such as a lack of task attentiveness or test fatigue. Though, in the case of a subset of student responses, participants appeared to “over-transfer” newly acquired knowledge (e.g., technical practices) by attempting to apply or “fit” it to situations where it may not be appropriate. This notable discrepancy suggests that these students may not fully grasp this new knowledge – in particular, understanding the techniques they used in their research. As individual changes are not published for comparable instruments (e.g., Stein et al., 2007), it is recommended that future research and (particularly) faculty assessment of their students using PBAs attends to these infrequent “losses” in student performance as they may lend insight relevant to identifying important gaps in student knowledge that may be obscured in group-level analysis and can be used to improve student learning.

In assessing the potential effects of URE participation on the development of students’ EPS and QL skills using the pilot PURE instrument, students demonstrated significant improvement over the course of the URE on the test overall (p < 0.001) and 10 of the 11 target skills (p <0.05) after controlling for multiple comparison effects. Target skill areas where significant gains were noted included: understand research design, understand research techniques, understand how research design may influence scientific findings, troubleshoot technical issues in data collection, represent data in a visual form, interpret visual data representations, understand basic stats, evaluate evidence and critique experimental designs, identify additional information needed to evaluate a hypothesis/interpretation, and provide alternative explanations for results that have many causes (Tables 3–5). Examination of effect sizes revealed that gains in these target skill areas were medium to high in magnitude for the URE participants (d = 0.47 to 1.17). In contrast, the one area that students did not demonstrate significant gains between pre- and posttest administration was understanding how to search for literature in the field. This may either reflect that students’ abilities in this area did not improve over the URE or that the target skill was measured using a single item focused on identifying specific databases to obtain information and not the process by which students would use to seek out relevant literature. This item will be refined in future iterations of the PURE instrument to gain a better resolution to the cognitive processes (or steps) students use in seeking out information. While restraint should be used in making substantial claims. The overall patterns in individual performance demonstrated that students made gains in the targeted EPS and QL skills over their respective research experience. In sum, these early findings suggest that URE participation assisted the development of students’ scientific thinking skills as measured by the PURE instrument.

An entrance and exit survey given to student participants included items asking them to rate their abilities in varying skills and activities associated with the research process. Changes in student self-ratings and performance data were compared across 10 research skills shared in the surveys and PURE instrument (see Table 6 for more details) to evaluate the degree of alignment between the direct and indirect data sources. Overall, student survey data suggested the effectiveness of the URE in contributing to skill development as significant differences were reported for five of the skill areas with lower, but notable, changes in the remaining skills (Fig. S3 in ESI†). Likewise, student performance scores for 10 of the partnered skills were statistically higher at the end of the experience, with the remaining skill (understanding how to search for literature in the field), again, demonstrating a lower degree of gains (Tables 3–5). However, when the extent of skill change is compared by data type, the relationship between self-report and performance data for most partnered skills was found to be weak (Table 6). As the measurement of cognitive skills is challenging and given uniqueness of each student's research experience, it is likely that one's performance, what one actually does in a real-world testing situation as a product of competence combined with individual [e.g. research activities, motivation] and system-based influences [e.g., provided mentorship], may not completely reflect their respective competence or what s/he is capable of (Rethans et al., 2002). In other words, specific practical competencies gained in the student's URE may not be fully transferred to the more generalized, writing-based PURE tasks. Thus, it can be recommended that the PURE instrument, as well as other similar PBAs, should be used to complement other assessment strategies (e.g., surveys, observations, interviews) that capture data that may be otherwise inaccessible. It can also be suggested that the use of subfield-relevant situations with the PURE framework (tasks and rubrics), may permit a higher level of agreement between what students demonstrate and their perceived project-related abilities; however, this may decrease test generalizability.

5. Limitations

There appear to be four noteworthy limitations for this project. Initially, as with most exploratory studies, the early findings from the PURE instrument should be interpreted with caution as they can not be generalized until data are collected from a larger sample and refinements are made to the initial instrument based on methodological investigations of the performance of each item. The initial results presented here are from a small sample (n = 24), which limits rigorous psychometric and quantitative analyses in the iterative process of assessment building as well as making inferences about students’ performance. Even so, given the heterogeneity of the student sample (e.g., academic standing, gender) and by exceeding the minimum threshold of participants typically needed to aptly assess an educational innovation using comparable performance instruments (e.g., the CAT instrument [Stein et al., 2007] requires n ≥ 15 subjects), it is reasonable to conclude that the data collected here are appropriate for such an exploratory study. In alignment with prior research using self-report data (e.g., Laursen et al., 2010), our preliminary results lend further insight to the effect of UR on student learning, however, it would be beneficial for future research to collect performance data from a greater number of participants to afford more rigorous characterizations of URE student learning.

Next, the PURE instrument is designed to be administered remotely without a proctor. Use of such an approach creates the opportunity for students to utilize external resources that may weaken the reliability of the data collected here. However, given the voluntary low-stakes nature of the test, which would expectedly decrease the likelihood of “cheating”, and the practical purpose of administering it to a national sample of students over multiple periods, the use of a remote tool was appropriate. In addition, given the autonomous completion of the test by students, it seems fair to expect that the data reported here was influenced by each student's level of self-motivation in providing thoughtful, complete responses which may bias performance results (Shadle et al., 2012). While students were compensated for their efforts, it is likely that this extrinsic factor alone may not have provided sufficient motivation in compelling students to take their time and “do their best”, and thus those students with intrinsic motivations (e.g., taking enjoyment in challenging one's abilities in the field) may have been selected for. As such, it may be useful in future studies to administer online PBAs to a subsample of students in a proctored setting with limited distractions and standardized period in an effort to ensure student attentiveness for comparative purposes.

Further, students’ written and drawn (i.e. graphing) responses to performance tasks are just two of a broad variety of means (e.g., observations, verbal questioning) by which STSs can be demonstrated (Linn et al., 1991). While task responses offer a rich data source, it seems fair to expect that this approach is inherently limited by favoring those students who are able to clearly articulate their scientific reasoning. It is rather intuitive that the extent of some students’ scientific thinking skills may not be as evident for those that have difficulties in writing, which may be particularly true for students underprepared in their educational training or who are novice researchers (Feldon et al., 2010). As well, it is likely that the measurement of scientific thinking skills may be influenced by how complete or explicative the student is in responding to the open-ended tasks. While taking that into account, expressing one's scientific thoughts in written form is considered a central means to how scientists share information, and, as such, suggests the value of such task responses as an appropriate means to measure student STSs (Timmerman et al., 2010).

Finally, data collection for this initial study was conducted with active URE participants and lacked a true comparison group. Thus, while student researchers acted as their own respective “controls” for assessing personal learning progressions by providing baseline and post-intervention data, future studies would be benefitted from a comparison group of students not actively engaged in undergraduate research who would be “matched” with their nearest neighbor URE participant based on commonalities in home institution, academic standing, and coursework.

6. Conclusions

In light of the increased emphasis on undergraduate research as an element of college science curricula (e.g., PCAST, 2012), it is important that these experiences contribute to the critical thinking/problem solving that is required of practicing scientists and scientifically literate citizens (AAAS, 2011). The development of effective URE programs is dependent on understanding how these experiences benefit students, which calls for rigorous assessments to the extent that students have mastered scientific thinking skills (Linn et al., 2015; NAS/NAE/NAM, 2017). The PURE instrument was designed and tested for the purpose of assessing chemistry URE students’ EPS and QL over time using contextually situated tasks (Harsh, 2016), though the developed framework may also have utility in other STEM disciplines as well as CURE courses. This study demonstrates the PURE instrument to be a valid and reliable assessment tool for providing insight into how skill proficiencies change throughout a research experience.

This early work provides several contributions to understanding how undergraduate research benefits students. Given the lack of direct measures to test the development of research-related competencies, the differences documented in URE students’ answer quality over time supports the viability of performance data to assess research skill growth. By using performance data, the study offers some direct evidence to the contributions of URE participation on the development of students’ specific scientific thinking. Longitudinal comparisons in student performance suggest that URE participation improved their proficiencies in experimental problem solving and data analysis. Specifically, students demonstrated significant gains in 10 of the 11 targeted scientific thinking skills related to understanding methods of scientific inquiry, analyzing scientific data, and evaluating scientific information. Similar to most exploratory studies, the results here cannot be readily generalized until data is collected from a larger group of URE students; however, this work begins to fill a void in the literature by providing actual performance data to the effect of URE participation on specific research skills.

Using performance data in this study resulted in an alternative perspective on participant skill development in undergraduate research. The early findings presented here align with existing literature based on self-report data to the broad educational outcomes associated with URE participation (e.g., Lopatto, 2007; Laursen et al., 2010). While the continued use of self-reported data is needed to lend insight to information not accessible by other means, studies that use valid and generalizable performance assessments that can directly measure changes in participant learning are essential to rigorously evaluate the effectiveness of these increasingly common experiences. Thus, it can be suggested that future mixed methods studies, as possible, should complement self-report indicators with performance data to provide a more comprehensive understanding to how these experiences benefit students. Such studies could define the types of gains students take away from UREs based on their background (e.g., academic standing) allowing for a greater understanding of the trajectories of skill development and how to “best” support student learning in these experiences.

Results of performance assessments should not only prove useful to science faculty and administrators in demonstrating programmatic success through students’ skill improvement but act as a basis for URE refinement by identifying gaps in student proficiencies. As this exploratory work demonstrates that PBAs can provide reliable insights and evidence to the effect of UREs on science students’ skill development, it is hoped that this work will encourage science faculty and science education researchers to move forward in the use of performance data as an indicator to the effectiveness of research training in support of student learning. To support this, similar to the emerging base of science concept inventories in the literature, the field would benefit from greater attention to the design and testing of valid, generalizable, and freely-available measures for monitoring student progress in research-related skills. As faculty and departments respond to national calls to engage students in UR for the promotion of the STEM workforce and a scientifically-literate citizenry, developing rigorous direct assessments of student learning through these interventions may have a formative role in improving science education.

Acknowledgements

Funding for the Undergraduate Scientists: Measuring Outcomes of Research Experiences project was in part provided by the NSF under Award #1140445. Any findings, opinions, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. We would like to thank Russell Balliet, Ryan Bouldin, Sarah Pettito, Stacy O’Reilly, Mikaela Schmitt-Harsh, Sara Skrabalak, Mary Walczak, Robert Sherwood, Heather Reynolds, Joshua Danish, and Jennifer Warner for their contributions to this study.

References

Ackerman P. L. and Kanfer R., (2006), Test length and cognitive fatigue. Final report to the College Board, Atlanta, GA: Author.
American Association for the Advancement of Science, (2011), Vision and change in undergraduate biology education: a call to action, Washington, DC: American Association for the Advancement of Science.
American Chemical Society Committee on Professional Training (ACS), (2008), Development of Student Skills in a Chemistry Curriculum, accessed November 2012 from: http://portal.acs.org/portal/PublicWebSite/about/governance/committees/training/acsapproved/degreeprogram/CNBP_025490.
Benjamini Y. and Hochberg Y., (1995), Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B (Methodological), 57(1), 289–300.
Bowman N., (2010), Can 1st-year college students accurately report their learning and development? American Educational Research Journal, 47(2), 466–496.
Boyer Commission on Educating Undergraduates in the Research University (Boyer Commission), (2008), Reinventing undergraduate education: three years after the Boyer report, Stony Brook, NY: State University of New York-Stony Brook.
Bradford J. D. and Schwartz D. L., (1999), Rethinking transfer: a simple proposal with multiple implications, Rev. Res. Educ., 24, 61–100.
Cohen J., (1992), A power primer, Psychol. Bull., 112(1), 155.
Cox M. and Andriot A., (2009), Mentor and undergraduate student comparisons of student's research skills, Journal of STEM Education, 10(1&2), 31–41.
Crowe M. and Brakke D., (2008), Assessing the impact of undergraduate-research experiences on students: an overview of current literature, CUR Q., 28(1), 43–50.
Dasgupta A. P., Anderson T. R. and Pelaez N., (2014), Development and validation of a rubric for diagnosing students’ experimental design knowledge and difficulties, CBE-Life Sci. Educ., 13(2), 265–284.
Delatte N., (2004), Undergraduate summer research in structural engineering, J. Prof. Issues Eng. Educ. Pract., 130, 37–43.
Feldon D. F., Maher M. A. and Timmerman B. E., (2010), Performance-based data in the study of STEM PhD education, Science, 329, 282–283.
Gonyea R. M., (2005), Self-reported data in institutional research: review and recommendations, New Dir. Inst. Res., 127, 73.
Gormally C., Brickman P. and Lutz M., (2012), Developing a test of scientific literacy skills (TOSLS): measuring undergraduates’ evaluation of scientific information and arguments, CBE-Life Sci. Educ., 11(4), 364–377.
Harsh J. A., (2016), Designing performance-based measures to assess the scientific thinking skills of chemistry undergraduate researchers, Chem. Educ. Res. Pract., 17(4), 808–817.
Harsh J. A., Maltese A. V. and Tai R. H., (2011), Undergraduate research experiences in chemistry and physics from a longitudinal perspective, J. Coll. Sci. Teach., 41(1), 84–91.
Harsh J. A., Maltese A. V. and Tai R. H., (2012), A longitudinal perspective of gender differences in STEM undergraduate research experiences, J. Chem. Educ., 89, 1364–1370.
Kardash C. M., (2000), Evaluation of an undergraduate research experience: perceptions of undergraduate interns and their faculty mentors, J. Educ. Psychol., 92, 191–201.
Kirschner P. A., (1992), Epistemology, practical work and academic skills in science education, Sci. Educ., 1(3), 273–299.
Laursen S., Seymour E., Hunter A. B., Thiry H. and Melton G., (2010), Undergraduate research in the sciences: engaging students in real science, San Francisco: Jossey-Bass.
Linn R. L., Baker E. L. and Dunbar S. B., (1991), Complex, performance based assessment: expectations and validation criteria, Educ. Res., 20(8), 15–21.
Linn M. C., Palmer E., Baranger A., Gerard E. and Stone E., (2015), Undergraduate research experiences: impacts and opportunities, Science, 347(6222), 1261757.
Lopatto D., (2007), Undergraduate research experiences support science career decisions and active learning, Cell Biology Education, 6, 297–306.
Lopatto D., (2010), Science in solution: the impact of undergraduate research on student learning, Washington, DC: CUR and Research Corporation for Scientific Advancement.
Maltese A. V., Harsh J. A. and Jung E., (in review) Evaluating Undergraduate Research Experiences – Development of a Self-Report Tool.
Mehrens W. A., (1992), Using performance assessment for accountability purposes, Educ. Meas.: Issues and Pract., 11(1), 3–9, 20.
Merriam S. B., (2009), Qualitative research: A guide to design and implementation: Revised and expanded from qualitative research and case study applications in education, San Franscisco: Jossey-Bass.
Miller M. D. and Linn R. L., (2000), Validation of performance-based assessments, Appl. Psychol. Meas., 24, 367–378.
National Academies of Sciences, Engineering, and Medicine, (2017), Undergraduate Research Experiences for STEM Students: Successes, Challenges, and Opportunities, Washington, DC: The National Academies Press, DOI: http://10.17226/24622.
National Council on Education and the Disciplines, (2001), Mathematics and Democracy, The Case for Quantitative Literacy, Washington, DC: The Woodrow Wilson National Fellowship Foundation.
National Research Council (NRC) – Committee on Science, Engineering and Public Policy, (2005), Rising above the gathering storm: energizing and employing America for a brighter economic future, Washington, DC: National Academies Press.
President's Council of Advisors on Science and Technology (PCAST), (2012), Engage to Excel: Producing One Million Additional College Graduates with Degrees in Science, Technology, Engineering, and Mathematics, Report to the President. Washington, DC: Executive Office of the President.
Raker J. R. and Towns M. H., (2012), Designing undergraduate-level organic chemistry instructional problems: seven ideas from a problem-solving study of practicing synthetic organic chemists, Chem. Educ. Res. Pract., 13(3), 277–285.
Rethans J. J., Norcini J. J., Baron-Maldonado M., Blackmore D., Jolly B. C., LaDuca T., Lew S., Page G. G. and Southgate L. H., (2002), The relationship between competence and performance: implications for assessing practice performance, Med. Educ., 36(10), 901–909.
Rueckert L., (2008), Tools for the Assessment of Undergraduate Research Outcomes, in Miller R. L. and Rycek R. F. (ed.) Developing, Promoting and Sustaining the Undergraduate Research Experience in Psychology, Washington, DC: Society for the Teaching of Psychology, pp. 272–275.
Sadler T. D. and McKinney L. L., (2008), Scientific research for undergraduate students: a review of the literature, J. Coll. Sci. Teac., 39(5), 68–74.
Seymour E. L., Hunter A. B., Laursen S. and DeAntoni T., (2004), Establishing the benefits of research experiences for undergraduates: first findings from a three-year study, Sci. Educ., 88, 493–594.
Shadle S. E., Brown E. C., Towns M. H. and Warner D. L., (2012), A rubric for assessing students’ experimental problem-solving ability, J. Chem. Educ., 89, 319–325.
Slater T. F. and Ryan R. J., (1993), Laboratory performance assessment, Phys. Teach., 31(5), 306–308.
Stein B., Haynes A., Redding M., (2007), Project CAT: assessing critical thinking skills, in Deeds D. and Callen B. (ed.), Proceedings of the 2006 National STEM Assessment Conference, Springfield, MO: Drury University.
Timmerman B., Strickland D., Johnson R. and Payne J., (2010), Development of a universal rubric for assessing undergraduates’ scientific reasoning skills using scientific writing, [Online]. University of South Carolina Scholar Commons http://scholarcommons.sc.edu/, accessed Aug 22, 2013.
Weston T. J. and Laursen S. L., (2015), The undergraduate research student self-assessment (URSSA): validation for use in program evaluation, CBE-Life Sci. Educ., 14(3), ar33.
Willison J., (2009), Multiple contexts, multiple outcomes, one conceptual framework for research skill development in the undergraduate curriculum, CUR Q., 29(3), 10–15.
Zoller U., (2001), Alternative assessment as (critical) means of facilitating HOCS-promoting teaching and learning in chemistry education, Chem. Educ. Res. Pract., 2(1), 9–17.

Footnotes

† Electronic supplementary information (ESI) available. See DOI: 10.1039/c6rp00222f

‡ In light of the variety that can be seen in these experiences, for the purpose this article, we use the Council of Undergraduate Research (CUR) definition of UR as “an inquiry or investigation conducted by an undergraduate that makes an original intellectual or creative contribution to the discipline” (Halstead, 1997, p. 1390).

§ These observations in UR data collection converge with a recent review by Linn et al. (2015).

¶ Please see Harsh (2016) in this Journal for a detailed description of the development, validation, and implementation of the PURE instrument.

|| It was anticipated that not all students would make gains over the URE for a subset of technique-specific items (such as Part A in Box 1) due to their respective research practices. While limitations of generalizability in testing is often a concern, having a small number of students act as a “comparison” group demonstrating little to no notable longitudinal change, lends evidence to instrument validity and treatment effect.

** Given the near identical estimates between pairwise comparisons across the analyses, only the paired sample t values are reported in text given the normality of the data.