Joseph
Harsh
*a,
John J.
Esteb
b and
Adam V.
Maltese
c
aJames Madison University, Harrisonburg, Virginia 22807-0001, USA. E-mail: harshja@jmu.edu
bButler University – Chemistry, Indianapolis, Indiana, USA
cIndiana University Bloomington, Bloomington, Indiana, USA
First published on 28th March 2017
National calls in science, technology, engineering, and technology education reform efforts have advanced the wide-scale engagement of students in undergraduate research for the preparation of a workforce and citizenry able to attend to the challenges of the 21st century. Awareness of the potential benefits and costs of these experiences has led to an emerging literature base outlining gains in participants’ cognitive, affective, and conative domains to support the impact of undergraduate research for students of all backgrounds; however, the majority of this work has relied on self-report data limiting inferences to the causal effects on student learning. As part of a larger project on apprentice-like undergraduate research experiences (UREs) in the physical sciences, the present exploratory study complemented indirect self-report data with direct performance data to assess the development of chemistry students’ scientific thinking skills over a research experience. Performance data were collected using the Performance assessment of Undergraduate Research Experiences (PURE) instrument, a validated tool designed to assess changes in chemistry students’ analytical and data driven decision-making skills through open-response tasks situated in real-world problems from primary literature. Twenty-four summer research students in chemistry (46% women; 50% 1st/2nd year students; 42% first time URE participant) from seven colleges and universities provided baseline and post-intervention performance data. Differences in pre/post-response task correctness provided a direct measure of individual changes in student competencies. Early study findings indicate the positive contributions of UREs to student's competencies in the areas of problem-solving, experimental design and the use of research techniques, data analysis and the interpretation of results, and the evaluation of primary literature. Survey data were also collected on students’ self-skill ratings to allow comparisons between perceived and demonstrated competencies, which were found to be weakly correlated. This work begins to offer direct evidence to the effect of UREs on student learning progressions as well as the potential use of performance test data in evaluating the success of research training interventions designed to improve scientific thinking skills.
The need to rigorously assess changes in UR participants’ disciplinary research skills, which are critical outcomes in the preparation for STEM careers (National Research Council [NRC], 2005), has led to recent calls for generalizable performance-based assessments (PBAs) that provide direct evidence to student learning (Feldon et al., 2010; AAAS, 2011; Linn et al., 2015; NAS/NAE/NAM, 2017). Through authentic tasks that are valued in their own right, PBA requires the application of knowledge in the construction of an original response (Linn et al., 1991; Mehrens, 1992; Zoller, 2001). Given the “real-world” nature of the assessment, in which the individual directly demonstrates their knowledge in context, performance measures are often assumed to have higher validity than indirect measures (Linn et al., 1991; Mehrens, 1992; Miller and Linn, 2000; Feldon et al., 2010). Common examples of PBAs include scientific writing, oral presentations, oral and written tests of skills, and direct observation. However, while PBA is a long-standing (Linn et al., 1991) and widely endorsed assessment practice in K-16 science (Slater and Ryan, 1993; AAAS, 2011), the use of performance data to evaluate the development of postsecondary students’ research training has been limitedly adopted (Feldon et al., 2010). This relative absence in undergraduate science may reflect challenges such as the lack of research competency standards in higher education, constraints of instructor time and resources, the inherent complexity of capturing skill development over time, that assessment of these skills are largely ignored, and the inherent difficulties involved in creating these measures (Rueckert, 2008; Feldon et al., 2010; Timmerman et al., 2010). In response, as a positive step forward, a growing base of vetted performance instruments in the literature (e.g., Gormally et al., 2012) and from evaluation groups (e.g., the Critical Assessment of Thinking [CAT] group at Tennessee Technological University; James Madison University Center for Assessment and Research Studies [CARS]) are becoming more available to faculty interested in assessing student learning in their classes.
Though the importance of UREs to student training in the sciences is without question, “it should not simply be assumed that a hands-on scientific task encourages the development of problem solving skills, reasoning ability, or more sophisticated mental models of the scientific phenomenon” (Linn et al., 1991, p. 19). Surprisingly, aside from discussions of potential means to assess UREs using performance data (Willison, 2009; Dasgupta et al., 2014), there is a paucity of direct evidence in the literature on the trajectories of research-related skill development for UR participants in the sciences (Crowe and Brakke, 2008; Linn et al., 2015). While validated instruments are available to measure students’ critical thinking or experimental problem-solving abilities, it can be argued that the majority of these tools are limited in evaluating the success of UREs on student learning due to a generalizable focus on non-science specific competencies (e.g., Stein et al., 2007), topic-specific problems that may limit generalizability (e.g., Shadle et al., 2012), and closed-response designs (e.g., Gormally et al., 2012) that may not appropriately represent task authenticity or capture complex cognitive skills (Stein et al., 2007).
In this article, we describe the early findings of the Performance assessment of Undergraduate Research Experiences (PURE) instrument, a performance test designed to directly measure changes in chemistry students’ scientific thinking skills (STS) over a summer URE. Designed to complement self-report data as part of a larger national project on UREs, the PURE instrument is a validated measure consisting of 16 multipart open-ended tasks contextualized around real-world chemistry problems and associated criterion-based rubrics focused on experimental problem solving (EPS) and quantitative literacy (QL) skills. Specifically, this pilot study was guided by the research questions: (A) How do students’ EPS and QL skills change during a URE? and (b) What is the level of agreement between URE students’ skill ratings and demonstrated performance? It was anticipated that skill changes for students would be discernible at the end of their respective URE, and that there would be moderate overlap in students’ self-report and performance data.
While the need for tools to directly measure URE student gains has been widely noted in the literature (Linn et al., 2015), to the best of our knowledge, this is the first study dedicated to assessing the development of participants’ disciplinary (chemistry) research skills. We feel this work advances our understanding of URE gains and may guide faculty, administrators, and other STEM researchers interested in investigating the efficacy and impact of UR. Similar to other educational innovations, accurately measuring students’ learning progressions over a research experience is essential for making decisions about programmatic refinement in support of learning outcomes (Kardash, 2000; Delatte, 2004).
The PURE instrument was designed to characterize the effect of UR on 11 scientific thinking skills (STSs) in the areas of experimental problem solving and quantitative literacy (see Table 1). As defined by the American Chemical Society (ACS, 2008) experimental problem solving (EPS) includes the ability to “define a problem clearly, develop testable hypotheses, design and execute experiments, analyze data, and draw appropriate conclusions” (p. 1). Quantitative literacy (QL), as defined by the National Council on Education and the Disciplines (NCED, 2001), is a habit of the mind, the knowledge and skills necessary to effectively engage in daily and scientific quantitative situations (e.g., reading visual data, understanding basic statistics presented in a study). These skills were selected as they are (a) highly valued by science faculty and reform organizations for scientific literacy (AAAS, 2011; Gormally et al., 2012), (b) reflect authentic practices used by practicing scientists (e.g., Kirschner, 1992; ACS, 2008), and (c) are gains commonly reported in URE literature (e.g., Laursen et al., 2010) that were specifically focused on in the survey data of the larger project.
PURE target skills | Explanation of skill | Associated survey item |
---|---|---|
Understand methods of inquiry that lead to scientific knowledge | ||
Understand how to search for information | Identify databases to search for information in the field relevant to a problem | “Conducting searches for literature related to your project” |
Understand research design | Identify a research design to experimentally address a scientific problem | “Developing your own research plan” |
Understand research techniques/instrumentation | Identify an instrument-based strategy to address a chemical problem | “Using advanced research techniques in your field” |
Understand how research design may influence scientific findings | Identify strengths and weaknesses of research design elements (e.g., potential sources of error, variables, experimental controls) | Identify strengths and weaknesses of research design elements (e.g., potential sources of error, variables, experimental controls) |
Troubleshoot technical issues | Evaluate a scientific problem to identify possible technical causes | “Troubleshooting theoretical/technical errors in research during data collection” |
Interpret, represent, and analyze quantitative scientific data | ||
Represent data in a visual form | Convert relevant information into an appropriate visual representation given the type of data | “Representing data in a visual form common for the research field” |
Interpret visual representations of data | Interpret or explain information presented in visual forms | “Interpreting visual representations of data” |
Understand basic statistics | Understand the need of basic statistics to quantify uncertainty in data or draw conclusions | “Interpreting statistical analysis of research” |
Evaluating scientific information | ||
Evaluate evidence and critique experimental designs | Understand the limits of correlational data and experimental design elements | “Interpreting and critiquing results and findings presented in the literature” |
Identify additional information needed to evaluate a hypothesis/interpretation | Explain how new information may contribute to the evaluation of a problem | “Identifying further information necessary to support research-related results in the literature” |
Provide alternative explanations for results that may have many causes | Recognize and explain possible alternative interpretations for given data or observations | N/A |
Box 1 Representative question and scoring criteria from the case study focusing on the effects of the pesticide atrazine on the sexual development of male frogs. |
In an effort to make the PURE tasks as generalizable as possible, the instrument was developed based on: (a) feedback from URE faculty mentors (n = 18) about the research practices and skills their students are commonly engaged in or expected to develop, (b) prior research on URE outcomes, (c) test items that permitted a range of responses, and (d) the use of multiple tasks to measure each target skill to improve test generalizability (Linn et al., 1991). Information was also obtained using a short set of supplement questions on student views of the PURE tasks (e.g., topical familiarity, recommendations for future test drafts) and relevant experiences that may have influenced their performance (e.g., technical practices).
The PURE test was administered to students at the start and end of their research experiences using Qualtrics survey software to provide baseline and post-intervention data for assessing changes in response quality. Students were instructed to answer the questions independently within a one-week period without the use of outside resources, and, if possible, to complete all tasks in one sitting. For the purpose of submitting the hand-drawn graphical data representations, students were asked to send the images electronically (i.e. high quality photos from a camera or smart phone, computer scanner) or via mail. IRB approval was obtained in advance of the study from Indiana University.
Target skill | Point-wise change (±) between pre- and posttest | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
−2 | −1 | 0 | +1 | +2 | +3 | +4 | +5 | +6 | +7 | +8 | +9 | +10 | |
Understand how to search for information | 16 | 5 | 3 | ||||||||||
Understand research design | 2 | 1 | 6 | 5 | 3 | 4 | 1 | 2 | |||||
Use (or understanding when to use) research techniques and instrumentation | 1 | 2 | 6 | 3 | 4 | 3 | 3 | 2 | 2 | ||||
Understand how research design may influence scientific findings | 1 | 2 | 4 | 2 | 2 | 2 | 4 | 1 | 3 | 2 | 1 | ||
Identify or troubleshoot technical issues in data collection | 1 | 4 | 4 | 4 | 7 | 2 | 2 | 1 | |||||
Evaluate evidence and critique experimental design to evaluate hypotheses | 1 | 1 | 5 | 6 | 7 | 2 | 1 | 1 | |||||
Identify additional information needed to evaluate a hypothesis/interpretation | 3 | 3 | 2 | 4 | 3 | 2 | 4 | 2 | 1 | ||||
Provide alternative explanations for results that may have many causes | 1 | 7 | 3 | 4 | 3 | 2 | 3 | 1 | |||||
Represent data in a visual form | 2 | 5 | 3 | 1 | 7 | 3 | 1 | 1 | 1 | ||||
Read and interpret visual representations of data | 2 | 2 | 5 | 7 | 2 | 4 | 2 | 1 | |||||
Understand basic statistics | 2 | 13 | 4 | 2 | 1 |
As to be expected given the likely idiosyncrasies in students’ research experiences, the extent of pointwise gains per target skill and item, if any, was found to be highly variable. As a specific example,|| for the task identify and explain an instrument-based strategy to address a chemical problem for the ATZ case study, which was used in measuring the broader skill area of using research techniques and instrumentation (Table 2), 7 students (29%) scored one point higher on the posttest, 3 (13%) two points higher, 3 (13%) three points higher, 1 (4%) four points higher, 1 (4%) five points higher, 2 (8%) demonstrated no change, 1 (4%) scored one point lower, and 4 (17%) scored three points lower (see Box 1 for the test item and associated scoring rubric). To qualitatively demonstrate longitudinal changes in student's pre- and posttest data, representative answers for this example task (with scored levels of proficiency) are presented in Boxes 2 and 3.
Box 2 Representative answer from a student (Student A, new-URE participant, senior) to the explanation of a strategy to test the presence and concentration of atrazine in water samples (Q.1) demonstrating a one point (level) gain. |
Box 3 Representative answer from a student (Student B, new-URE participant, junior) to the explanation of a strategy to test the presence and concentration of atrazine in water samples (Q.1) demonstrating a multipoint (level) gain. |
In this particular task, students needed to clearly demonstrate an understanding of the use of an instrumentation-based strategy to solve a real-world chemistry problem by addressing the relevant features of the technique. It is important to point out that student responses were evaluated as to why they identified a given strategy independent of the “correctness” of the selected technical approach. Student A provided a set of answers that demonstrated a small level of change by presenting a more focused justification of her strategy on the posttest; however, there is no detail provided to how the approach will be used to solve the problem in terms of compound identification. In comparison, Student B's set of answers demonstrated a higher level of change as technical errors on the use of instrumentation (i.e. mass spectrometry alone is not an appropriate tool to determine the relative components of a mixture) are included in the pretest, which are corrected with a more complete justification or explanation for the strategy in the posttest response. Such differences may reflect a number of student factors (e.g., research activities, prior knowledge) and suggests the idiosyncratic nature of skill development in these experiences. While it can be argued that the higher level of sophistication between Student B's pre- and posttest reasoning and technological knowledge may be attributed to his potential use of this strategy over the URE, the ability to appropriately transfer new information to a new situation does support that learning occurred for the participant (Bradford and Schwartz, 1999).
Interestingly, as seen in Table 2, a small proportion of students (between 4% and 21%) on several items scored lower on the posttest than the pretest. Most commonly, in the case of these “losses”, students scored one point less on the post-than pretest. As it is not likely that students regressed in their capabilities over a short 8 to 10 week URE period, the results suggest that these differences may often reflect random effects such as being less attentive in task completion, which is represented in Box 3.
In this instance, Student C simply provided less detail in his justification for how the identified strategy would help solve the problem (i.e. compound detection) on the posttest that the pretest. However, in comparison, it was noted that a subset of students attempted to “over-transfer” newly acquired technical knowledge to novel situations where it may not be appropriate. See Box 5 for a representative example as Student D shifted his pretest response from the “best” analytical technique (gas chromatography-mass spectrometry [GCMS]) for testing the presence and concentration of different compounds in a mixture (i.e. atrazine in lake water) to a less appropriate technique for the conclusive identification of a compound (ultraviolet-visible spectroscopy [UV-Vis]), which the student used in his work.
Box 4 Representative answer from a student (Student C, new-URE participant, senior) to the explanation of a strategy to test the presence and concentration of atrazine in water samples (Q.1) demonstrating a one point (level) “loss”. |
Box 5 Representative answer from a student (Student D, new-URE participant, senior) in the identification of a strategy to test the presence and concentration of atrazine in water samples (Q.1) demonstrating a shift from the “best” instrument to a less appropriate one. |
In these occasional instances of students trying to leverage new knowledge inappropriately (∼9% of total responses across technique-focused items), it suggests that they may not fully understand the newly learned concepts and procedures. While infrequent in the data, direct evidence to such gaps in student proficiencies is of particular value for providing insight to how learning can be improved.
Group-level pairwise comparisons between test administrations revealed posttest scores (M = 92 [maximum of 146 points], SD = 12) were significantly higher than those on the pretest (M = 72, SD = 9), t(24) = 11.23, Z = 3.72, p < 0.001 (Fig. 1). The estimated effect size was d = 2.65, indicating that the treatment effect was of large magnitude (d > 0.8) for contributing to the development of participants’ STSs (Cohen, 1992).
Fig. 1 Comparison of students’ (n = 24) total score on the PURE Instrument with data conceptualization questions by test administration. Error bars represent 95% CIs around the means. Maximum 146 points possible. Figure replicated from Harsh (2016) with permission. |
Analysis of the mean differences in participants’ pre/posttest scores are first described by large target skill category (i.e. understanding methods of scientific inquiry, analyzing scientific data, and evaluating scientific information), and then by the nested EPS and QL target skills (Tables 3–5). Significance tests were adjusted for multiple comparisons using Benjamini and Hochberg's False Discovery Rate (Benjamini and Hochberg, 1995) procedure, at the 0.05 level.
Target skill category | Points possible | Pretest meanc (SD) | Posttest meanc (SD) | Prob. of difference (t-test) | p valuea | FDR adjusted p valueb | Effect size (Cohen's d) |
---|---|---|---|---|---|---|---|
a p value at 95% CI. b Target skill p values adjusted using the BH FDR procedure to control for multiplicity. c n = 23 students. | |||||||
Understand how to search for information in the field | 3 | 1.75 (1.17) | 1.98 (1.05) | 1.06 | 0.302 | 0.050 | 0.21 |
Understand research design | 11 | 5.25 (2.48) | 7.52 (1.86) | 4.90 | 0.0001 | 0.005 | 0.80 |
Understand research techniques | 15 | 4.43 (2.94) | 6.7 (2.83) | 3.47 | 0.0001 | 0.005 | 0.68 |
Understand how research design may influence scientific findings | 15 | 6.46 (2.74) | 9.45 (2.67) | 3.94 | 0.001 | 0.027 | 0.78 |
Troubleshoot technical issues in data collection | 16 | 2.83 (2.04) | 4.94 (2.48) | 5.69 | 0.0001 | 0.005 | 1.17 |
Total for large skill category: understanding methods of scientific inquiry | 60 | 19.86 (5.70) | 30.05 (5.97) | 8.65 | 0.0001 | — | 1.66 |
Target skill category | Points possible | Pretest meanc (SD) | Posttest meanc (SD) | Prob. of difference (t-test) | p valuea | FDR adjusted p valueb | Effect size (Cohen d) |
---|---|---|---|---|---|---|---|
a p value at 95% CI. b Target skill p values adjusted using the BH FDR procedure to control for multiplicity. c n = 24 students. | |||||||
Evaluate evidence and critique experimental designs | 11 | 6.79 (1.93) | 7.98 (1.83) | 3.00 | 0.006 | 0.036 | 0.61 |
Identify additional information needed to evaluate a hypothesis/interpretation | 12 | 4.56 (1.85) | 7.17 (3.11) | 4.61 | 0.001 | 0.027 | 0.94 |
Provide alternative explanations for results or relationships that may have many causes | 6 | 1.70 (1.52) | 3.37 (1.56) | 3.41 | 0.0001 | 0.005 | 0.69 |
Total large skill category: evaluating scientific information | 29 | 13.1 (2.53) | 18.56 (4.13) | 7.10 | 0.0001 | — | 1.49 |
Target skill category | Points possible | Pretest meanc (SD) | Posttest meanc (SD) | Prob. of difference (t-test) | p valuea | FDR adjusted p valueb | Effect size (Cohen's d) |
---|---|---|---|---|---|---|---|
a p value at 95% CI. b Target skill p values adjusted using the BH FDR procedure to control for multiplicity. c n = 21. | |||||||
Represent data in a visual form | 40 | 27.24 (3.28) | 30.69 (3.19) | 5.61 | 0.0001 | 0.005 | 0.59 |
Read and interpret visual representations of data | 12 | 6.13 (1.56) | 7.02 (2.37) | 3.17 | 0.036 | 0.045 | 0.65 |
Understand basic statistics | 5 | 3.25 (1.59) | 3.71 (1.76) | 2.31 | 0.031 | 0.041 | 0.47 |
Total for large skill category: interpret, represent, and analyze quantitative data | 57 | 36.65 (2.5) | 41.23 (4.16) | 5.72 | 0.000 | — | 1.29 |
For the five scientific inquiry-related target skill areas (Table 3), post hoc pairwise comparisons revealed that the pretest scores for the target skills of understand research design, understand research techniques to solve a problem, understand how research design may influence scientific findings, and troubleshoot issues in data collection, were significantly lower than for the posttest. Effect sizes were medium to large in magnitude for understand research techniques (d = 0.68), understanding how research design may influence findings (d = 0.78), understand research design (d = 0.80) and troubleshoot issues (d = 1.17). There were no significant differences between pre- and posttest data for the target skill understanding how to search for information in the field.
Changes in students’ pre- and post-URE self-ratings and performance scores for the shared skills were examined to evaluate the extent of agreement between the indirect and direct measures. As seen in Table 6, the correlations between self-report and performance data for all skills were found to be moderate to weak (r < 0.40). To some degree, the low correlation between measures was anticipated due to multiple reasons, including the potential bias in rating or inconsistency in self-rating one's abilities, that students’ perceived abilities in their respective research experience may not fully align with the more generalized performance tasks lack of alignment between test tasks and target skills, and student “losses” in the performance data. In addition, the use of non-equivalent scales across measures (i.e. 5-point Likert-type survey items v. performance scores across grouped target skill questions) may have affected the comparison.
PURE target skills | Associated survey item | Pre/post-change |
---|---|---|
r | ||
Understand how to search for information in the field | “Conducting searches for literature related to your project” | −0.319 |
Understand research design | “Developing your own research plan” | 0.178 |
Use research techniques and instrumentation | “Using advanced research techniques in your field” | −0.25 |
Understand how research design may influence scientific findings | Identify strengths and weaknesses of research design elements (e.g., potential sources of error, variables, experimental controls) | 0.201 |
Troubleshoot technical issues in data collection | “Troubleshooting theoretical/technical errors in research during data collection” | 0.112 |
Represent data in a visual form | “Representing data in a visual form common for the research field” | 0.361 |
Interpret visual representations of data | “Interpreting visual representations of data” | −0.063 |
Understand basic statistics | “Interpreting statistical analysis of research” | 0.171 |
Evaluate evidence and critique experimental designs to evaluate hypotheses | “Interpreting and critiquing results and findings presented in the literature” | 0.193 |
Identify additional information needed to evaluate a hypothesis/interpretation | “Identifying further information necessary to support research-related results in the literature” | 0.184 |
Longitudinal analysis of students’ baseline and post-URE responses revealed the idiosyncratic nature of skill development, and that, on average, students made individual improvements in ∼8 of the 11 possible measured target skills. These results align with the broad outcomes commonly outlined in prior literature examining URE skill contributions using self-report data (as reviewed in Laursen et al., 2010), and extends current knowledge by providing direct evidence of participant learning progressions. It should be noted that a smaller, but noteworthy, group of students per item (4% to 21%) scored lower on the posttest than the pretest. As it is unlikely that students “lost” knowledge during the 8 to 10 week URE period, we believe it is reasonable to assume that these results may reflect random effects such as a lack of task attentiveness or test fatigue. Though, in the case of a subset of student responses, participants appeared to “over-transfer” newly acquired knowledge (e.g., technical practices) by attempting to apply or “fit” it to situations where it may not be appropriate. This notable discrepancy suggests that these students may not fully grasp this new knowledge – in particular, understanding the techniques they used in their research. As individual changes are not published for comparable instruments (e.g., Stein et al., 2007), it is recommended that future research and (particularly) faculty assessment of their students using PBAs attends to these infrequent “losses” in student performance as they may lend insight relevant to identifying important gaps in student knowledge that may be obscured in group-level analysis and can be used to improve student learning.
In assessing the potential effects of URE participation on the development of students’ EPS and QL skills using the pilot PURE instrument, students demonstrated significant improvement over the course of the URE on the test overall (p < 0.001) and 10 of the 11 target skills (p <0.05) after controlling for multiple comparison effects. Target skill areas where significant gains were noted included: understand research design, understand research techniques, understand how research design may influence scientific findings, troubleshoot technical issues in data collection, represent data in a visual form, interpret visual data representations, understand basic stats, evaluate evidence and critique experimental designs, identify additional information needed to evaluate a hypothesis/interpretation, and provide alternative explanations for results that have many causes (Tables 3–5). Examination of effect sizes revealed that gains in these target skill areas were medium to high in magnitude for the URE participants (d = 0.47 to 1.17). In contrast, the one area that students did not demonstrate significant gains between pre- and posttest administration was understanding how to search for literature in the field. This may either reflect that students’ abilities in this area did not improve over the URE or that the target skill was measured using a single item focused on identifying specific databases to obtain information and not the process by which students would use to seek out relevant literature. This item will be refined in future iterations of the PURE instrument to gain a better resolution to the cognitive processes (or steps) students use in seeking out information. While restraint should be used in making substantial claims. The overall patterns in individual performance demonstrated that students made gains in the targeted EPS and QL skills over their respective research experience. In sum, these early findings suggest that URE participation assisted the development of students’ scientific thinking skills as measured by the PURE instrument.
An entrance and exit survey given to student participants included items asking them to rate their abilities in varying skills and activities associated with the research process. Changes in student self-ratings and performance data were compared across 10 research skills shared in the surveys and PURE instrument (see Table 6 for more details) to evaluate the degree of alignment between the direct and indirect data sources. Overall, student survey data suggested the effectiveness of the URE in contributing to skill development as significant differences were reported for five of the skill areas with lower, but notable, changes in the remaining skills (Fig. S3 in ESI†). Likewise, student performance scores for 10 of the partnered skills were statistically higher at the end of the experience, with the remaining skill (understanding how to search for literature in the field), again, demonstrating a lower degree of gains (Tables 3–5). However, when the extent of skill change is compared by data type, the relationship between self-report and performance data for most partnered skills was found to be weak (Table 6). As the measurement of cognitive skills is challenging and given uniqueness of each student's research experience, it is likely that one's performance, what one actually does in a real-world testing situation as a product of competence combined with individual [e.g. research activities, motivation] and system-based influences [e.g., provided mentorship], may not completely reflect their respective competence or what s/he is capable of (Rethans et al., 2002). In other words, specific practical competencies gained in the student's URE may not be fully transferred to the more generalized, writing-based PURE tasks. Thus, it can be recommended that the PURE instrument, as well as other similar PBAs, should be used to complement other assessment strategies (e.g., surveys, observations, interviews) that capture data that may be otherwise inaccessible. It can also be suggested that the use of subfield-relevant situations with the PURE framework (tasks and rubrics), may permit a higher level of agreement between what students demonstrate and their perceived project-related abilities; however, this may decrease test generalizability.
Next, the PURE instrument is designed to be administered remotely without a proctor. Use of such an approach creates the opportunity for students to utilize external resources that may weaken the reliability of the data collected here. However, given the voluntary low-stakes nature of the test, which would expectedly decrease the likelihood of “cheating”, and the practical purpose of administering it to a national sample of students over multiple periods, the use of a remote tool was appropriate. In addition, given the autonomous completion of the test by students, it seems fair to expect that the data reported here was influenced by each student's level of self-motivation in providing thoughtful, complete responses which may bias performance results (Shadle et al., 2012). While students were compensated for their efforts, it is likely that this extrinsic factor alone may not have provided sufficient motivation in compelling students to take their time and “do their best”, and thus those students with intrinsic motivations (e.g., taking enjoyment in challenging one's abilities in the field) may have been selected for. As such, it may be useful in future studies to administer online PBAs to a subsample of students in a proctored setting with limited distractions and standardized period in an effort to ensure student attentiveness for comparative purposes.
Further, students’ written and drawn (i.e. graphing) responses to performance tasks are just two of a broad variety of means (e.g., observations, verbal questioning) by which STSs can be demonstrated (Linn et al., 1991). While task responses offer a rich data source, it seems fair to expect that this approach is inherently limited by favoring those students who are able to clearly articulate their scientific reasoning. It is rather intuitive that the extent of some students’ scientific thinking skills may not be as evident for those that have difficulties in writing, which may be particularly true for students underprepared in their educational training or who are novice researchers (Feldon et al., 2010). As well, it is likely that the measurement of scientific thinking skills may be influenced by how complete or explicative the student is in responding to the open-ended tasks. While taking that into account, expressing one's scientific thoughts in written form is considered a central means to how scientists share information, and, as such, suggests the value of such task responses as an appropriate means to measure student STSs (Timmerman et al., 2010).
Finally, data collection for this initial study was conducted with active URE participants and lacked a true comparison group. Thus, while student researchers acted as their own respective “controls” for assessing personal learning progressions by providing baseline and post-intervention data, future studies would be benefitted from a comparison group of students not actively engaged in undergraduate research who would be “matched” with their nearest neighbor URE participant based on commonalities in home institution, academic standing, and coursework.
This early work provides several contributions to understanding how undergraduate research benefits students. Given the lack of direct measures to test the development of research-related competencies, the differences documented in URE students’ answer quality over time supports the viability of performance data to assess research skill growth. By using performance data, the study offers some direct evidence to the contributions of URE participation on the development of students’ specific scientific thinking. Longitudinal comparisons in student performance suggest that URE participation improved their proficiencies in experimental problem solving and data analysis. Specifically, students demonstrated significant gains in 10 of the 11 targeted scientific thinking skills related to understanding methods of scientific inquiry, analyzing scientific data, and evaluating scientific information. Similar to most exploratory studies, the results here cannot be readily generalized until data is collected from a larger group of URE students; however, this work begins to fill a void in the literature by providing actual performance data to the effect of URE participation on specific research skills.
Using performance data in this study resulted in an alternative perspective on participant skill development in undergraduate research. The early findings presented here align with existing literature based on self-report data to the broad educational outcomes associated with URE participation (e.g., Lopatto, 2007; Laursen et al., 2010). While the continued use of self-reported data is needed to lend insight to information not accessible by other means, studies that use valid and generalizable performance assessments that can directly measure changes in participant learning are essential to rigorously evaluate the effectiveness of these increasingly common experiences. Thus, it can be suggested that future mixed methods studies, as possible, should complement self-report indicators with performance data to provide a more comprehensive understanding to how these experiences benefit students. Such studies could define the types of gains students take away from UREs based on their background (e.g., academic standing) allowing for a greater understanding of the trajectories of skill development and how to “best” support student learning in these experiences.
Results of performance assessments should not only prove useful to science faculty and administrators in demonstrating programmatic success through students’ skill improvement but act as a basis for URE refinement by identifying gaps in student proficiencies. As this exploratory work demonstrates that PBAs can provide reliable insights and evidence to the effect of UREs on science students’ skill development, it is hoped that this work will encourage science faculty and science education researchers to move forward in the use of performance data as an indicator to the effectiveness of research training in support of student learning. To support this, similar to the emerging base of science concept inventories in the literature, the field would benefit from greater attention to the design and testing of valid, generalizable, and freely-available measures for monitoring student progress in research-related skills. As faculty and departments respond to national calls to engage students in UR for the promotion of the STEM workforce and a scientifically-literate citizenry, developing rigorous direct assessments of student learning through these interventions may have a formative role in improving science education.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c6rp00222f |
‡ In light of the variety that can be seen in these experiences, for the purpose this article, we use the Council of Undergraduate Research (CUR) definition of UR as “an inquiry or investigation conducted by an undergraduate that makes an original intellectual or creative contribution to the discipline” (Halstead, 1997, p. 1390). |
§ These observations in UR data collection converge with a recent review by Linn et al. (2015). |
¶ Please see Harsh (2016) in this Journal for a detailed description of the development, validation, and implementation of the PURE instrument. |
|| It was anticipated that not all students would make gains over the URE for a subset of technique-specific items (such as Part A in Box 1) due to their respective research practices. While limitations of generalizability in testing is often a concern, having a small number of students act as a “comparison” group demonstrating little to no notable longitudinal change, lends evidence to instrument validity and treatment effect. |
** Given the near identical estimates between pairwise comparisons across the analyses, only the paired sample t values are reported in text given the normality of the data. |
This journal is © The Royal Society of Chemistry 2017 |