Max R.
Helix
a,
Laleh E.
Coté
ab,
Christiane N.
Stachl
c,
Marcia C.
Linn
ad,
Elisa M.
Stone
e and
Anne M.
Baranger
*ac
aGraduate Group in Science and Mathematics Education, University of California, Berkeley, CA, USA. E-mail: abaranger@berkeley.edu
bWorkforce Development & Education, Lawrence Berkeley National Laboratory, CA, USA
cDepartment of Chemistry, University of California, Berkeley, CA, USA
dGraduate School of Education, University of California, Berkeley, CA, USA
eCalTeach Program, University of California, Berkeley, CA, USA
First published on 12th November 2021
Understanding the impact of undergraduate research experiences (UREs) and course-based undergraduate research experiences (CUREs) is crucial as universities debate the value of allocating scarce resources to these activities. We report on the Berkeley Undergraduate Research Evaluation Tools (BURET), designed to assess the learning outcomes of UREs and CUREs in chemistry and other sciences. To validate the tools, we administered BURET to 70 undergraduate students in the College of Chemistry and 19 students from other STEM fields, comparing the performance of students who had less than one year of undergraduate research to those with more than one year of research experience. Students wrote reflections and responded to interviews during poster presentations of their research project. BURET asks students to communicate the significance of their project, analyze their experimental design, interpret their data, and propose future research. Scoring rubrics reward students for integrating disciplinary evidence into their narratives. We found that the instruments yielded reliable scores, and the results clarified the impacts of undergraduate research, specifically characterizing the strengths and weaknesses of undergraduate researchers in chemistry at our institution. Students with at least a year of research experience were able to use disciplinary evidence more effectively than those with less than one year of experience. First-year students excelled at explaining the societal relevance of their work, but they incorporated only minimal discussion of prior research into their reflections and presentations. Students at all levels struggled to critique their own experimental design. These results have important implications for undergraduate learning, suggesting areas for faculty members, graduate student research mentors, and CURE or URE programs to improve undergraduate research experiences.
Despite a national call to replace traditional introductory laboratory courses with research-based courses, this practice is still emerging at U.S. colleges and universities (Olson and Riordan, 2012; Laursen, 2019). Those course-based undergraduate research experiences (CUREs) that have been developed in chemistry allow students to develop self-confidence and project ownership, as well as contributing to novel research in chemistry (Kerr and Yan, 2016; Ghanem et al., 2018; Cruz et al., 2020). Additionally, students who complete these courses believe that they have learned more chemistry content than they would have in traditional lecture and laboratory courses (Chase et al., 2017). These research-based courses support student interest in chemistry, as students find them to be more enjoyable than “cookbook” laboratories with predetermined project outcomes (Clark et al., 2016; Mutambuki et al., 2019; Muna, 2021). Several studies have explored the benefits of group-based approaches to supporting undergraduates as they learn about and conduct chemistry research (Danowitz et al., 2016; Hauwiller et al., 2019).
Due to their prevalence and potential impact, it is important to assess the effects of science research experiences on student learning, in order to determine how students progress over time and to identify how research experiences can be improved to better serve participants (Auchincloss et al., 2014). Such assessments are relevant to both CUREs and undergraduate research experiences that take place in research laboratories (UREs). Most previous studies that assess learning outcomes of science research experiences are limited to a description of the research experience or self-report data; fewer studies validate self-reports with analysis of research products, direct measures of mastery of scientific content or practice, or observations of student activities (Linn et al., 2015; National Academies of Sciences and Medicine, 2017; Krim et al., 2019; Lin et al., 2019). Scholars such as Pagano et al. (2018) and Stone et al. (2020) have commented on the limited number of studies dedicated to examining the impacts of CUREs in chemistry, when compared to the life sciences. Thus, there is a need for additional assessment tools that can be applied to undergraduate research experiences in chemistry, both inside and outside of the classroom.
Literature on undergraduate research and educational policy documents have identified the following scientific practices as foundational to research experiences for undergraduates (Laursen et al., 2010; Sadler et al., 2010): formulating research questions or hypotheses, designing experiments, analyzing and interpreting data, making conclusions, iteratively planning next steps, and explaining the significance of the research project. Collectively, these scientific reasoning skills are widely regarded as a critical component of science education; educators have moved away from the idea that such skills involve a single cognitive activity, and they are most often viewed as a “set of different but coordinated skills” (Opitz et al., 2017). Thus, the goal of this study was to develop assessment tools to be used with STEM majors, with a particular focus on chemistry, that measure the extent to which they understand research as a set of connected practices. A specific aim for this work was to focus on assessing students’ understanding of scientific practices in the context of their own research project, rather than investigating their ability to answer questions about a hypothetical scenario. In this study, we address the following research questions:
(1) Do the tools we developed distinguish between undergraduate students with different levels of prior research experience?
(2) What do our tools tell us about what students understand about research and what they are still learning at different stages of their undergraduate careers?
The process of conducting research generates knowledge in a way that parallels the KI framework (Linn et al., 2015). Activities such as predicting and hypothesizing allow for eliciting undergraduate students’ initial ideas. Undergraduate researchers then begin discovering new ideas over time as they gather data and participate in other research practices (Linn and Eylon, 2011; White and Gunstone, 2014). They gradually learn to distinguish between possible interpretations for their data, and reflecting on their research enables learners to consolidate knowledge and generate new ideas for future work (Brown et al., 1989; Linn and Eylon, 2011). KI guides our expectation that as undergraduates progress in research, they will become more proficient in understanding and discussing their research project, linking their insights to relevant discipline-specific content knowledge to form coherent arguments.
Some instruments are “authentic assessments,” which are meaningful opportunities for students to integrate and apply their knowledge to novel, complex, and/or realistic situations that simulate typical activities of scientists (Wiggins, 1998; Doğan and Kaya, 2009; Laungani et al., 2018). For example, the Experimental Design Ability Test (EDAT) gives students a real-world scenario and research question and tasks them with designing an appropriate experiment, and has been used in chemistry and the life sciences (Sirum and Humburg, 2011; Goodey and Talgar, 2016). Several studies suggest that writing activities can support student understanding of chemistry concepts (e.g., Lewis dot structure model) and methods (e.g., spectroscopy), as well as confidence in communicating about the material (Shultz and Gere, 2015; Moon et al., 2018; Watts et al., 2020). The Rubric for Science Writing and the Tool to assess Interrelated Experimental Design (TIED) are two assessment tools designed for use in undergraduate science courses, which involve students in activities that scientists engage in (Timmerman et al., 2011; Killpack and Fulmer, 2018).
There is compelling evidence to suggest that participation in a CURE leads to significant gains in research skills and academic outcomes and can support the subsequent advancement to (and success in) a URE (Rodenbusch et al., 2016; Krim et al., 2019). Studies that measure student learning gains typically consider these gains only over the course of a semester-long research experience, though there is evidence to suggest that undergraduates need to participate in high-impact research experiences spanning more than one semester to develop their understanding of the research process (Deane et al., 2014; Corwin et al., 2015; Griffeth et al., 2015; Harsh, 2016; Remich et al., 2016; Hernandez et al., 2018). A longitudinal study by Szteinberg and Weaver (2013) suggests that CURE students retain chemistry content knowledge longer, as compared to students in traditional laboratory courses.
In order to become independent researchers, undergraduates are also expected to develop an understanding of experimental design (Sirum and Humburg, 2011; Killpack and Fulmer, 2018). Undergraduates are typically presented with narratives about previously completed experiments as part of their STEM coursework, but training in designing experiments is less common (Gormally et al., 2012). When reading scientific papers, undergraduates commonly struggle with evaluating and critiquing the design elements used in the studies being discussed (Varela et al., 2005; Coil et al., 2010). Guided-inquiry laboratories, CUREs, and UREs, in which students design their own experiments, can be used to support experimental design skills in chemistry (Goodey and Talgar, 2016). Multiple studies make the case that instruments are needed to measure experimental design and other skills critical for the development of students as scientists as they prepare to advance in their professional career (e.g., Sirum and Humburg, 2011; Dasgupta et al., 2014, 2016; Danczak et al., 2020).
Science research experiences often require that students contribute to data interpretation, but many undergraduates enter introductory-level STEM courses with insufficient skill in understanding how to work with data (e.g., reading graphs, analyzing and interpreting data, creating data visualizations), and STEM coursework does not necessarily cover this content (Coil et al., 2010; Maltese et al., 2015). Comparing the data analysis skills of various researchers showed that novices are more heavily reliant on personal beliefs, while those with more expertise focus on empirical consistency to draw conclusions from their observations (Hogan and Maglienti, 2001). Chemical education studies suggest that students need to be taught explicitly how to generate and interpret the kinds of visualizations they will need for a particular project, and instruction should be intentional about connecting data to relevant concepts and addressing misconceptions (Connor et al., 2019; Rodriguez et al., 2019). Relatively few studies in chemistry have focused on assessing student skill level in this area, though there is consensus that data interpretation is critical for developing chemists (Maltese et al., 2015; Peteroy-Kelly et al., 2017).
Undergraduate students should also be able to develop hypotheses and conduct appropriate experiments to test these hypotheses by the time they graduate with a STEM bachelor's degree (White et al., 2013). When students are provided with the space to encounter challenges, revise their research goals, and repeat their work, this iterative process can have a powerful impact on their sense of ownership as they learn to navigate obstacles in their scientific discipline (Corwin et al., 2018; Gin et al., 2018). CUREs focused on chemistry have been shown to improve students’ project ownership in lower-division, upper-division, and large-enrollment undergraduate courses (Williams and Reddish, 2018; Cruz et al., 2020; Heller et al., 2020).
(I) Communicate the significance of their specific project to the overarching research questions of the laboratory and the broader scientific field.
(II) Justify their experimental design as appropriate for their research question.
(III) Analyze and interpret data in order to construct explanations and models that are relevant to their research question.
(IV) Generate hypotheses and plan future experiments relevant to their research question in response to their analysis and interpretation of data.
These Indicators provide the focus for the new instruments described in this study. We designed an interview protocol and reflective prompts to assess how undergraduates develop an integrated knowledge of these dimensions as they engage in research. The first is the Reflection instrument (BURET-R), which prompts written student reflections about the progress of their research project. The second is the Poster interview (BURET-P), which is administered at capstone poster sessions. Both tools were administered to students in a variety of research settings.
Group | n | Response rate (%) | Type | Duration | Prior research experience | Student description |
---|---|---|---|---|---|---|
1 | 35 | 58 | CURE | Semester | Mostly none | Freshman chemistry students |
2 | 6 | 90 | CURE | Summer | None | New chemistry transfer students |
3 | 28 | 59 | URE | Ongoing | Variable | Department of chemistry students |
4 | 5 | 65 | CURE | Semester | None | Pre-service STEM teachers |
5 | 15 | 88 | URE | Summer+ | Variable | Pre-service STEM teachers |
Students in Groups 1 and 4 (Table 1) were enrolled in CUREs in which the student was responsible for developing their own research question and choosing the methods used to investigate that question. Students in Group 2 chose from possible research projects that could be investigated using computational chemistry approaches. Students in Groups 3 and 5 had typical apprentice-style research experiences in faculty labs, where the projects varied but fit into the overarching goals of their faculty advisor and were generally related to the projects of their graduate student mentor. The level of independence in designing their own work varied as well, generally according to the amount of time each undergraduate had spent working in their research group.
The students participating in the study ranged from having zero to four or more semesters of research experience prior to study participation. The study population of 89 undergraduates contained a mixture of identities, including gender, race, ethnicity, and first language. Our study participants were 58% female, and 24% stated that English is not their first language. Students who self-identified as American Indian/Alaska Native, Black/African-American, Hispanic/Latinx, or other Pacific Islander, collectively referred to as underrepresented minorities (URM), were intentionally oversampled and comprised 19% of our study population.
A pair of reflective prompts (BURET-R, Fig. 2) were developed to complement the poster presentation assessment. These prompts can be administered at different points in the research experience to provide information on students’ developing progress on the Indicators. In this study, BURET-R was administered a few weeks prior to their poster session. These prompts targeted Indicators 3 and 4, respectively, but many students also incorporate discussions of Indicators 1 and 2 in their responses.
Item | BURET-R | BURET-P |
---|---|---|
Placing their work in a broader context | X | |
Placing their work in a broader scientific context | X | |
Placing their work in a broader societal context | X | |
Providing rationale for an experimental design choice | X | X |
Addressing limitations of an experimental design choice | X | |
Comparing alternatives to an experimental design choice | X | |
Number of experimental design choices with some rationale | X | |
Identifying and discussing the key variables | X | |
Describing their data analysis procedures OR Interpreting their data | X | |
Interpreting their data | X | |
Analyzing sources of error and uncertainty | X | |
Proposing next steps for the project | X | X |
Incorporating references to previous work | X | |
Integrating additional content knowledge | X | X |
Although many of the research projects assessed during this study involved experiments, this was not universally the case. To account for other types of research, “experimental design choice” was defined as any approaches, strategies, techniques, or other decisions made during the study design process. This definition was sufficiently broad to encompass the work described by all students who participated in this study.
To develop a KI rubric for each item, we were guided by prior KI rubrics for similar items. The KI rubrics score items on a 0–5 scale. Each score represents a progressively more integrated and connected response. Applying the KI framework to BURET scoring, descriptions were written for each possible score on each item (see Appendix A for the complete rubric). These were anchored by the idea that a 2 should be a correct statement about an isolated part of the research process, and a 4 should be a clear, basic link from a relevant part of the research process to evidence from scientific content and research practices. For example, a score of 2 on “addressing the limitations of an experimental design choice” could be obtained by simply noting a drawback for a particular technique, whereas a score of 4 would require that students explain that limitation by integrating underlying scientific principles into their discussion or by making a clear reference to the research question. The remaining levels were defined as follows: 0 indicates that responses relevant to the item are absent, 1 indicates a vague statement, 3 indicates a partial link between an assertion and relevant scientific content or practices, and 5 indicates a complex link of 3 or more isolated concepts. The highest level descriptions were informed in part by the graduate students who responded to the BURET-R and BURET-P assessments as scoring categories were being refined. A partial scoring rubric with example responses can be found in Appendix B.
Additionally, students presenting at one of the two target URE poster sessions (Groups 3 and 5 in Table 1) were invited by email to participate in this study. Responses to BURET-R were collected via Qualtrics a few weeks prior to the poster sessions, and all consenting students who provided responses to BURET-R were interviewed at their poster session, using the same protocol that was used with the CURE students. From our full dataset, 80 BURET-R responses and 55 BURET-P interviews were found to be complete and fully legible or audible, and these were used in our subsequent analysis.
Item-response theory (IRT) analysis was conducted to gather validity evidence based on internal structure at the instrument level. Because our sample was not sufficient to run the analysis using all thresholds from our rubric, data were collapsed into scores of low (0–2), moderate (3), or high (4–5), and Wright maps for each instrument were generated from the collapsed data. Additionally, exploratory factor analysis was performed and item-test correlations were calculated to determine whether the construct we are measuring is uni- or multi-dimensional. All statistical analysis was conducted on Stata except for the IRT analysis, which was performed on Conquest.
***p < 0.001; **p < 0.01; *p < 0.05. | |||
---|---|---|---|
Semesters of previous research experience | 0–1 | 2+ | Sig. |
Sample size (n) | 42 | 38 | |
Placing work in a broader context | 2.9 | 3.6 | * |
Providing rationale for expt. design choice | 1.9 | 2.8 | ** |
Identifying and discussing the key variables | 2.0 | 2.4 | |
Describing OR interpreting data analysis | 2.6 | 3.5 | ** |
Proposing next steps for the project | 2.6 | 3.2 | |
Integrating additional content knowledge | 0.8 | 2.4 | ** |
Average score | 2.1 | 3.0 | *** |
***p < 0.001; **p < 0.01; *p < 0.05. | |||
---|---|---|---|
Semesters of previous research experience | 0–1 | 2+ | Sig. |
Sample size (n) | 24 | 31 | |
Placing work in broader scientific context | 2.3 | 3.5 | * |
Placing work in broader societal context | 3.6 | 3.7 | |
Providing rationale for expt. design choice | 3.5 | 3.9 | |
Addressing limitations of expt. design choice | 2.8 | 3.3 | |
Comparing alternatives to expt. design choice | 2.7 | 3.4 | |
Expt. design choices with some rationale (max. 5) | 2.5 | 3.3 | * |
Interpreting their data | 3.1 | 3.5 | * |
Analyzing sources of error and uncertainty | 2.3 | 2.5 | |
Proposing next steps for the project | 3.1 | 3.2 | |
Incorporating references to previous work | 1.9 | 2.9 | * |
Integrating additional content knowledge | 2.3 | 3.5 | * |
Average score | 2.7 | 3.3 | *** |
As a measure of reliability, Cronbach's alpha is calculated to be 0.78 for both instruments, which is in the range considered acceptable for science education research instruments (Taber, 2018). Further psychometric analysis suggests an acceptable consistency of the items to measure respondent performance and provides evidence that a unidimensional construct is being measured (see Appendix D for more information).
Two other variables that are highly correlated with increased research experience are year in school and whether the research experience was part of a course. As previously mentioned, most of the novice researchers in our sample were enrolled in a CURE, while most of the advanced researchers were participating in a URE in a faculty lab and had previously completed a CURE. To determine which of these variables was the best predictor of total score on our instruments, factorial ANOVAs were run using year in school, semesters of research experience, and URE/CURE as the independent variables. For the BURET-R, only URE/CURE was a significant predictor (p < 0.05), whereas the duration of time spent in college or in undergraduate research were not. For the BURET-P, only semesters of research experience was a significant predictor (p < 0.05). On this instrument, the type of research experience did not have a significant effect on student performance, and the URE students with minimal total research experience performed similarly to other novice researchers. No interaction terms were significant for either instrument. We were unable to examine whether there were differential effects for students who identified as a URM because we recruited too few of these students with at least 2 semesters of research experience. Future work will be needed to investigate this aspect of our instrument.
Indicator | Items |
---|---|
a The items are grouped by Indicator for clarity of interpretation. Note that there is no evidence from the internal structure of the instruments for the items to be grouped in this way. | |
I | Placing their work in a broader scientific context |
Placing their work in a broader societal context | |
Incorporating references to previous work | |
Integrating additional content knowledge | |
II | Providing rationale for an experimental design choice |
Addressing limitations of an experimental design choice | |
Comparing alternatives to an experimental design choice | |
Number of experimental design choices with some rationale | |
III | Interpreting their data |
Analyzing sources of error and uncertainty | |
IV | Proposing next steps for the project |
Advanced students often demonstrated a more integrated understanding of scientific context by explaining the current state of the field or how their research might affect projects in other labs. One chemistry student who received a score of 4 stated, “I was … working on investigating … the mechanical properties of polycarbonate urethane. Our research is particularly relevant to joint implants and joint replacements, … the current industry standard polymer is called ultra high molecular wave polyethylene…. Polycarbonate urethane or PCU is being pioneered as a new material… But it's pretty new so we're still doing research on the very mechanical properties and how it will react to being in the body and in an ionic environment where there is salts and stuff like that, that can affect its microstructure.” In this response, the student clearly connects their work on the mechanical properties of PCU to the broader field of material science, particularly in the area of artificial joints.
A student would receive a 2 on the “Incorporating references to previous work” item by clearly referring to previous research but failing to explicitly link that research to their experimental design or compare it to their own results. For example, a student was scored 2 for the following vague reference to previous work, ‘A lot of it was help from literature that we've seen online, especially the solvents. I wouldn't have known where to start without using some of these.’ As an example of a higher scoring discussion, one student stated that, “There'd been, not a consensus, but almost every single study that we had read previously looking for these heavy metals in chocolate, but also in other candy, had focused on the cocoa, then being the source and maybe mentioned other possible sources in passing.” The student then compared this body of previous work with their own work, which found a possible alternate source of heavy metals, resulting in a score of 4.
Additionally, advanced students scored higher on providing context by integrating more additional content knowledge into their presentation and answers. Additional content knowledge was defined as “exhibiting scientific content knowledge beyond what is required to describe the project.” Students received a 2 by simply providing some additional clarification, or a 4 by providing multiple examples or extensive discussions of relevant information. It should be noted that this does not directly measure the content knowledge of a student, but rather the extent to which students have integrated that content knowledge into discussions of their research.
Examples varied broadly, from why the research group chose to study a certain topic to the specific instruments used to collect raw data. In the BURET-P interview protocol, students were asked why they made a given design choice instead of something else, and they were also asked about the limitations of that choice. In general, both novice and advanced students scored highly on providing a rationale for an experimental design choice related to their project for the BURET-P instrument; over half of the students scored 4 or higher, which requires a clear description of the design choice and an explicit rationale that integrates domain-specific content knowledge. For example, “We chose to use micro plasma atomic emissions spectroscopy because of its wide dynamic range. While there were many other instruments that would have worked similarly well, but not within this large range. And we were very uncertain as to whether we were over diluting or under diluting our samples… We only had rough EPA guidelines to kind of guide our choices.” The marginally higher average score for advanced students compared to novices was not significant. However, advanced students did explain a greater number of their decisions than novices. To reflect this, an item was included that simply counted the number of design choices for which the student provided some rationale. This number was significantly higher (p < 0.05) for advanced students, reflecting the greater detail in which they described their experimental design.
Students were less proficient at discussing the limitations of experimental design choices. A representative response is, “So if the standards aren't prepared correctly or if they're too high on concentration, it may negatively, it definitely will negatively affect our data. So I think that's a big limitation.” which received a 2 for only identifying user error as a possible limitation. However, some students were able to discuss limitations more fluently; for example, the following excerpt scored a 4: “The limitations of that technique are that bringing it under PBS, which is phosphate buffered saline, only mimics the ionic concentrations. It doesn't mimic the chemical function. So [what] we'd like to do for further research is hydrate it in [inaudible], which … mimics in vivo synovial fluid.” Both novice and advanced students showed moderate levels of sophistication on the “comparing alternatives” item but rarely scored as high as 4, for which they needed to make a clear comparison between their choice and the alternative, explaining why their choice was superior. For example, “We decided to use MPAS instead of graphite furnace atomic absorption spectroscopy, even though both measure lead very well. Because MPAS has a larger dynamic range, and we were very uncertain as to the concentration we were gonna get.”
For example, “And we found that with lower concentrations of silver, we get the same amount of silver conductivity” scored a 2 because there was a clear statement about the experimental results but no additional comments were made about the data or their conclusions. A score of 4 required students to explain what they observed: “We stained the plates, which contained the cellulose media, with Congo red, which is a dye that binds to cellulase. So what that allowed us to do is once we washed the excess dye away, we got results that looked like this: the bacterial colonies that didn't produce any cellulase show no halo, and the whole plate is red, because the cellulose is still there, the dye binds, it's all still there. The ones that you see here have a halo of white, are positive results. They produce cellulase, and we know that because around the bacterial colony, is a halo where the cellulose has been degraded, and the dye doesn't bind.” This chemical biology student describes the underlying mechanism of the assay, explaining what is happening on a molecular and cellular level to justify their interpretation.
While scores for data interpretation were generally relatively high, students performed less well on analyzing sources of error and uncertainty. Most students identified a clear potential source of error or expressed skepticism about their results, but less than half of the students elaborated on their answer or connected that source of error to either their experimental design or their conclusions. A more complete response might explain how the experiment was designed to control for possible sources of error. For example, “And also, to avoid error we wanted to use NMR. First, we dissolve our wristbands using deuterated chloroform, and then running that through NMR, and seeing if there are any errors that we can possibly encounter for contamination. We just wanted to make sure the wristbands were mostly silicon. We had a positive control and negative control in just the chemical that we tested.” However, most students did not discuss sources of error at this level and there were no significant differences between novice and advanced students.
Students who received a 2 on this item typically suggested “more”-based continuations of their work with no rationale: more trials, more substrates, more different temperatures, and so on. In contrast, students who received a 4 would include a rationale that integrates domain-specific content knowledge; for example, one chemistry student said, “In the future we hope to perform confocal microscopy to determine the depth of infiltration, that's another common problem with current scaffolds is that they'll grow in an x–y plane and spread out in a nice flat layer, but they don't go into the bi-layer membrane. So that's what we're hoping to get with these fiber mats later, when you spin onto a mesh collector plate you get these really nice nodes, and we're hoping that cells could easily fit into those pores and infiltrate deeper into the membrane.” Most students fell in between these two points; over 50% of participants scored a 3 on this item.
In contrast, students at all levels performed well on providing an integrated societal context for their work, and more advanced students did not receive higher average scores on this item. The ability to discuss the broader impacts of a research project is a valued skill, with some institutions offering courses explicitly aimed at training students in this area (MacFadden, 2009; Heath et al., 2014). In two of the CUREs included in this study, students developed research questions, often addressing a societal issue of interest to them, and as a result, they could fluently discuss the societal relevance of their project. Because novice students were strong on this item, there was little growth with more research experience.
The general trends we observed for experimental design also hold for data interpretation; we showed that students generally performed well on giving straightforward interpretations of their data but were less likely to provide a richer description unless specifically prompted. Scores on the combined data analysis and interpretation items on both instruments were relatively high, with advanced students scoring significantly higher than novice students. This is consistent with other studies showing that data interpretation skills correlate with increased research experience (White et al., 2011; Harsh et al., 2017). In contrast, one of the lowest scoring items for both novice and advanced students was their ability to identify and discuss potential sources of error in their work. Students may deliberately focus on more positive aspects of their project, or the low scores may reveal a genuine deficit among undergraduates, who have been shown to struggle with critically analyzing experimental designs, generating data visualizations, and interpreting chemical data (Varela et al., 2005; White et al., 2011). Our study suggests that students may benefit from targeted interventions in these areas throughout their undergraduate career.
A concept from the literature that is closely related to the item on proposing future work is that of iteration, as students scored higher when the proposed work was linked in some way to their most recent results. Authentic research is an iterative process, where the data from one experiment helps inform the next. Some have suggested that iteration is an essential part of an undergraduate research experience (e.g., Auchincloss et al., 2014), and efforts have been made to explicitly include iteration in CUREs (Light et al., 2019). Although there are instruments that measure whether a student perceives iteration to be a part of their research experience (Corwin, Graham, et al., 2015), to our knowledge, there are no instruments that assess student proficiency in proposing next steps for an ongoing research project.
We anticipated that advanced students would be more experienced at proposing future experiments and would therefore be able to more fluently discuss them in their written responses and poster presentations. Although this was not reflected in the average scores, we observed that only advanced students received the highest possible score for proposing next steps on either the BURET-R or BURET-P instrument. Additionally, most of the advanced graduate students who were interviewed during the development of the instrument (see Methods), scored at the highest level on the BURET-P for this item.
One potential explanation for the discrepancy between expectations and observed results for undergraduate researchers on average is that many of the advanced undergraduate presenters were weeks away from graduation. Those students were likely in the process of concluding their research and were not planning longer-term directions of the project. As a result, their scores on proposing future work might be lower than if we had interviewed them earlier. In contrast, many novice students were enrolled in a one-semester CURE in which they were explicitly instructed to talk about future work as part of their poster presentation. Their relative success in this area suggests that, contrary to faculty expectations, even novice students can be expected to propose the next steps of their research project, and this expectation should be more explicitly integrated into UREs.
Chemistry sub-discipline | Excerpt from student poster presentation |
---|---|
Biochemistry | My interpretation of these results is that the R-pal is utilizing the thiosulfate to grow and produce ammonia, so that's the main takeaway of this experiment and that if we took out thiosulfate and replaced it with another electron donor then they would grow with those electrons donated from that |
Inorganic chemistry | What I’ve done here is I’ve synthesized a magnet that targets the lanthanide that has a strongly axial crystal field, but also a radical bridge, and this works very well because the 2,2′-bipyrimidine, that is substituted with chlorines, is a very weak epineural donor and so the crystal field becomes more axial because you have such a weak epineural donor even though you still have a radical lanthanide bridge |
Materials chemistry | But the decrease is that prevention of growth that I was talking about, [due to] the charge neutralization of the bromide ions on the ends of the surfactant. So, if the surfactant is more packed, no more gaps are available for precipitation to occur, and so you can’t grow any nanorods per se. All you’re gonna be left with is a bunch of spherical nanoparticles, no growth curve. So that's the reason for this decrease |
Physical chemistry | We first ran them on the mass spec to know that we know, that it's working as a control. So we can tell there's one peak for the full rotaxane, and then over time we can see the cleaved product come off, and that peak grows in over time. So then after 16 hours it works. So we go to do it with xenon NMR we can see the same thing after 16 hours you have a pretty full peak come in for CB-6. Here, this is the water peak for CB-6, we always see that in the xenon NMR experiment, and you see CB-6 peak, that's the xenon going in and out there |
Atmospheric chemistry | What is shown here is the VOC reactivity to show it's relatively constant, and then the NOx concentrations, and the ozone concentrations. So the NOx decreases from weekday to weekend because there are less giant trucks driving. Then this is showing that ozone decreases, but it doesn’t really decrease that much, it's basically the same |
Self-selection bias is a limitation of undergraduate research studies, because those who participate are likely to be among the most highly motivated and high performing students. We expect selection bias to be minimal in our case, as approximately 70% of chemistry majors, who make up the majority of our sample, participate in undergraduate research, giving them an opportunity to participate in the poster session from which we recruited our participants.
Moreover, the BURET instruments provide an informal, low-stakes method for mentors to check on the progression of their students. Research mentors can regularly observe students setting up and analyzing the results of experiments, but they often have fewer opportunities to probe how their undergraduate students think about the research project more broadly. The BURET-R can be used as a way to quickly gauge how the student discusses their project in response to open-ended questions. This data can serve to guide research mentors to initiate conversations with the student to strengthen their understanding of the project and to consider how to better turn what they know into an integrated narrative about their project. Additionally, the act of responding to the BURET-R prompts is itself a useful opportunity for the student to reflect on their project, which may not be a regular feature of their research experience. Similarly, answering the BURET-P protocol questions is an inherently useful activity, as it can help students to strengthen their poster talks and provide practice taking questions from the audience.
At a departmental or institutional level, the BURET instruments can be used at regular intervals to assess how well a particular research experience is supporting student learning as they progress from novice to advanced researchers. The BURET instruments complement self-report survey data by enabling educational researchers to directly measure student learning with respect to knowledge and skills that are critical for their development as scientists. In the event that certain BURET Indicators are of greater importance with a particular student group, specific probes, like the interview questions in the BURET-P instrument, can be used to further explore student thinking for different components of the research process. Both of the BURET instruments can be used to provide students with feedback about their strengths and knowledge gaps with respect to the research project they are working on in a CURE or URE. These instruments can also be used to compare different research experiences, providing individual CUREs or UREs with information about the areas in which students need additional instruction or training from their research mentors.
Score | BURET-R | BURET-P – scientific | BURET-P – societal |
---|---|---|---|
0 | Does not explain goals of experiment or project | Does not explain goals of project | Does not explain goals of project |
1 | Partial or unclear description of experiment and/or project goals | Partial or unclear description of project goals | Only discusses “personal” goals, but does not mention a societally relevant topic |
2 | States goal of experiment OR states goal of project | Clearly states goal of project OR States a very limited scientific application of their work | Collecting data with no further connection to societal importance OR Reader can infer societal importance or application of the data collected (i.e. mentions a societally-relevant topic like semiconductor or cancer) |
3 | Clearly states goal of experiment AND States goal of project (vagueness allowed) | States a general area of science that their work contributes to OR Vague or implied version of below | Implies societal importance OR Vague statement about the possible benefits or use of results |
4 | Partial Link (3) AND (explains how expt advances larger project OR explicit link of project to broader significance (scientific or societal)) | Discusses how future projects (by other labs) might be affected by current project OR Suggests new research paths or projects that could be based on this work OR Provides sufficient background for reader to understand current state of field | Explicitly connects project to specific societal need OR Explicit statement about the possible benefits or use of results (Accurate content knowledge and coherent argument should be present. However, exact mechanism of connection does not need to be stated.) |
5 | Partial link (3) AND explains how expt advances larger project or the portion of the project they are working on AND explicit link of project to broader significance (scientific or societal) | Basic Link (4) with two out of three of the criteria present OR Two different scientific contexts explained for the project - both at Basic Links (4) | Explicit comparison between current project goals and existing solutions to those problems. Exact mechanism of connection does need to be stated. OR Explicit and specific statement about the possible benefits or use of results, including statement of existing societal issue or need |
Score | BURET-R & P – rationale | BURET-P – limitation | BURET-P – comparison |
---|---|---|---|
0 | Coder cannot identify any design choice discussed | Does not discuss any limitations of design choice | Does not mention any alternative design choices |
1 | Partial or unclear description of one design choice | Vague statement of a very generic limitation or logistical issue OR Vague or implied description of a thoughtful limitation – implied in the description of results | Mentions the fact that there are alternatives, but doesn’t mention what these are |
2 | Clear description of one design choice, but rationale is poor or absent | Clear statement of a very generic limitation or logistical issue OR Vague description of a thoughtful limitation | Mentions specific alternative, but no comparison OR Compares to alternative because alternative is, in their opinion “not possible” |
3 | Clear description of one design choice AND gives reasonable (sounding) rationale but vague, implied, or invokes little to no content knowledge | One or more thoughtful limitations mentioned, but content knowledge only implied | Compares design choice to an alternative, but is somewhat vague or implied OR Compares to alternative because alternative is, in their opinion “not possible”, plus why it wouldn’t be possible |
4 | Clear description of one design choice AND gives explicit rationale for choice of instrument or experiment that integrates domain-specific content knowledge | Gives at least one explicit limitation that integrates domain-specific content knowledge | Comparison to an alternative design choice on a single facet with a clear statement of difference or advantage or reason to use one or the other |
5 | Basic Link (4) but multiple distinct reasons for design choice are discussed AND Strong evidence of extensive content knowledge that supports their choices | Basic Link (4) AND (discusses how limitations affect conclusions OR discusses how limitation was addressed, minimized, avoided, etc.) | Clear comparison to an alternative design choice on more than one facet OR 3 or more Basic Links (4) |
Score | BURET-R definition |
---|---|
0 | Does not indicate what type of data is being collected or discuss any other relevant variables |
1 | Isolated Concept but vague or implied (unclear what they are actually measuring, manipulating, comparing) OR Basic instrument verification on a standard |
2 | Clearly identifies what is being measured (raw OR analyzed) OR Clearly identifies one or more variables being manipulated, compared, or held constant |
3 | Isolated Concept (2) AND (Provides basic rationale for choice of variables and/or range being investigated OR Gives details on how or to what extent the variables are manipulated) OR Basic link (4), but rationale or predictions are vague or questionable |
4 | Clearly identifies what is being measured (raw OR analyzed) AND Clearly states one or more variables being manipulated, controlled or compared AND (Provides rationale (clear, but slightly generic okay) for why manipulated variables would affect measurements/output OR Provides reasonable prediction of how manipulated variables will affect output) |
5 | Basic Link (4) AND Rationale and/or predictions are strong and integrate content knowledge |
Score | BURET-R – manipulation | BURET-R and P – interpretation |
---|---|---|
0 | Does not describe any analysis of raw data | Does not describe results OR Has not collected data yet |
1 | States that no data analysis was performed OR States that results are inconclusive with no elaboration | Unclear how conclusion is supported by results OR Implies data interpretation but does not sufficiently describe |
2 | States a procedure for analyzing or manipulating data with no elaboration | Summarizes results without interpretation OR Pre-packaged conclusion OR States an interpretation with no connection to data |
3 | Links raw data to analyzed results, but discussion of data or analysis method/procedure is vague | Summarizes results and links to content knowledge or compares to expectations, but vague or minimal insights |
4 | Clearly links raw data to analyzed results, including (clear) description of the analysis process | Gives plausible explanation for results (or compares results to expectations in a way) that integrates clear content knowledge |
5 | Basic Link (4), plus discusses at least one assumption or consequential decision made during analysis | Basic Link (4), but integrates extensive content knowledge OR Discusses alternate interpretations |
Score | BURET-P definition |
---|---|
0 | Does not identify any potential sources of error |
1 | States that the experiment (or a large part of it) “didn’t work” without any elaboration as to why OR Describes confidence in the ability of methods to answer the RQ |
2 | Identifies a clear “error” in what was done OR Vague reference to limitation of method/technique when discussing confidence in results OR Vague “doubts” about data |
3 | Identifies potential sources of error that are less “obvious” OR Clear reference to limitation of method/technique when discussing confidence in results |
4 | Clearly identifies potential reasonable source(s) of error AND Mentions how these connect to at least one of the following: 1. Research questions; 2. Experimental design (current or future); 3. Their conclusions |
5 | Clearly identifies multiple distinct potential reasonable source(s) of error at the level of a Basic Link (4) |
Score | BURET-R and P definition |
---|---|
0 | Does not discuss any potential future work |
1 | Completely different goals for future work with no/minimal relationship to current work OR Implies that they will “continue with the plan” but does not sufficiently describe |
2 | Simple quantitative extension, modification, or new experiment with no or poor rationale OR “Continue with the plan” OR Repeat experiment with simple issue fixed |
3 | Simple quantitative extension with good rationale OR Modification or new experiment with credible but vague rationale OR Repeat experiment after difficult-to-predict issue fixed (troubleshooting), link to content knowledge is vague or absent |
4 | Modification, troubleshooting, or new experiment with clear rationale that integrates content knowledge |
5 | Multiple Basic Links (4), at least one of which is not a borderline Partial Link (3) OR (Basic Link AND Explicitly links new choices to the results of current work) |
Score | BURET-P definition |
---|---|
0 | Does not mention any prior work |
1 | Vague references to “other studies” without any specific designs/results or clear specification of how this informs part of project |
2 | Clear reference to previous work, but no stated connection to current work OR Vague reference to previous work with connection to current project |
3 | Clear description of previous design or results AND (Vague connection to/influence on current work OR Vague comparison b/w old and new design or results) |
4 | Summarizes previous work (specific design or results) AND (Explicitly states how it connects to/influenced current work OR Compares to current results) |
5 | Basic Link (4) AND (Explanation of how current work is different or novel OR Attempts to interpret sim/diff between current and previous results) |
Score | BURET-R | BURET-P |
---|---|---|
0 | Response does not integrate any scientific content knowledge beyond what is necessary to describe the project | Response does not integrate any scientific content knowledge beyond what is necessary to describe the project |
1 | (Not used) | (Not used) |
2 | Weak example of a Partial Link (3) | Weak example of a Partial Link (3) |
3 | Exhibits scientific content knowledge beyond what is required to describe project | Exhibits scientific content knowledge beyond what is required to describe project |
4 | (Not used) | Exhibits extensive scientific content knowledge beyond what is required to describe project |
5 | Exhibits extensive scientific content knowledge beyond what is required to describe project | Multiple Basic Links (4) |
Score | Description | Examples |
---|---|---|
0 | – Does not discuss any limitations of design choice | |
1 | – Vague reference to limitations | – “Again, part of the main problem is that graphite furnace is really temperamental.” |
2 | – Clear statement of a generic limitation, OR – Vague description of thoughtful limitation | – “In terms of that technique, I think it depends on the accuracy in which the solutions are prepared. So if the standards aren't prepared correctly or if they're too high on concentration, it may negatively, it definitely will negatively affect our data. So I think that's a big limitation. And also you have to produce a lot of different samples, which can be time consuming.” |
3 | – One or more thoughtful limitations mentioned, but content knowledge only implied | – “The limitations of Congo Red is that it is visual. So it is qualitative even though we can't measure the radius. The radius isn't really going to tell us anything numerical about how much cellulose the bacteria digests.” |
4 | – Gives at least one explicit limitation that integrates domain-specific content knowledge | – “One experimental technique that we use is hydrating the sample and then putting them under nanoindentation… So the limitations of that technique are that you’re running it under PBS, which is phosphate buffered saline, and that only mimics the ionic concentrations, it doesn't mimic the chemical functionality you’d encounter in in vivo synovial fluid.” |
5 | – Basic Link (4) AND (– Discusses how limitations affected conclusions OR – Discusses how limitation was addressed, minimized, avoided, etc.) | – “The main limitation is that the scaled particle theory ignores the entropic consideration in the energy of interaction here, so it's hard to say what would happen at different temperatures. In order to predict the temperature dependence, you need an approximate value of the entropy of dissolution, which isn’t known for a lot of these molecules. However, we found that that's actually very easy to predict. For each group of molecules it's approximately constant for a certain chlorination number so you know that if you have a PCB and it has three chlorines that you will know the entropy very well.” |
Additionally, all students presenting at one of the two target URE poster sessions were invited by email to participate in this study. For the chemistry poster session, 112 students were invited to participate and 66 consented, for a response rate of 59%. For the pre-service teacher poster session, 23 of the 26 students (88%) responded affirmatively to the invitation. Responses to the reflective prompts were collected via Qualtrics a few weeks prior to the poster sessions. All consenting students who provided answers to our reflective prompts (30 from the chemistry poster session and 23 from the pre-service teacher session) were interviewed at these poster sessions, using the same protocol. From this pool of URE participants, 6 URM students and a random sample of 24 other students were chosen for further analysis. An additional 15 students for whom we had prompt responses but not poster session interviews were also randomly selected for analysis. In total, the dataset we analyzed included 80 responses to reflective prompts and 55 poster session interviews.
Item-response theory (IRT) analysis was then conducted to establish the internal structure at the instrument level (Wilson, 2005). Because the sample size was not sufficient to run the analysis using all thresholds from the rubrics, data were collapsed into scores of low (0–2), moderate (3), or high (4–5), and Wright maps for each instrument were generated from the collapsed data, and there was at least one response for each possible answer choice in order to fit the data to an item response model. The resulting Wright maps (see Fig. 3 and 4) show that the range of instrument item logit values span nearly the entire distribution of respondent logit values, with only a few students falling below all item thresholds on the BURET-R instrument, and a few Thurstonian thresholds located below the lowest respondent logit value for the BURET-P instrument. The reliability of partial credit model analysis carried out on the data is 0.77 for BURET-R and 0.76 for BURET-P. These values indicate an acceptable consistency of the items to measure respondent performance (Wright and Masters, 1982; Bond and Fox, 2007).
This journal is © The Royal Society of Chemistry 2022 |