Andrew
Apugliese
and
Scott E.
Lewis
*
Department of Chemistry, Center for the Improvement of Teaching & Research in Undergraduate STEM Education, University of South Florida, Tampa, FL 33620, USA. E-mail: slewis@usf.edu
First published on 5th December 2016
Meta-analysis can provide a robust description of the impact of educational reforms and also offer an opportunity to explore the conditions where such reforms are more or less effective. This article describes a meta-analysis on the impact of cooperative learning on students’ chemistry understanding. Modifiers in the meta-analysis are purposefully chosen to model instructors’ decisions in implementing cooperative learning. Modifiers investigated include: using cooperative learning periodically or in every class period; setting a maximum group size at four or smaller versus five or larger; using closed-ended or open-ended assessments; and assessing a single topic or assessing the cumulative topics in the course. The results showed cooperative learning's effectiveness is robust across a wide range of instructional decisions except no evidence of effectiveness was found with cumulative assessments. The overall results from the meta-analysis provide a benchmark for evaluating future efforts to evaluate pedagogical interventions in chemistry.
Specific to science, technology, engineering and math (STEM) education, Bowen conducted a meta-analysis of 37 research studies, published between 1980 and 1996, on the effectiveness of CL in undergraduate STEM courses (Bowen, 2000). Across the 37 studies, 49 effect sizes were calculated and averaged to find an effect size of 0.51. The treatment of multiple effect sizes from a single study (e.g. data from multiple tests were reported) as separate data points provides greater weight to those studies relative to the studies with only one effect size reported. The decision to treat each effect size as a separate data point can be problematic as studies with multiple effect sizes still represent only one independent sample.
Recently, Warfa conducted a meta-analysis on the impact of CL in chemistry classes (Warfa, 2016). The analysis began by reviewing the literature from 2001 through 2015 using the search terms “cooperative learning” and “chemistry” and either “treatment group” or “control group”. The resulting articles were screened for the following characteristics: studies authored in English, occurred within a face-to-face chemistry classroom setting, had an outcome measure of chemistry achievement, used experimental/quasiexperimental research design that compared CL to a control pedagogy and provided sufficient statistical information to enable analysis (Warfa, 2016). These criteria identified 25 articles from the research literature. Multiple data points from one sample within a study were combined for a single effect size. One study (Acar and Tarhan, 2008) was removed from the analysis as an outlier and another study contributed two data points as it listed two independent samples, one composed of General Chemistry students and another Organic Chemistry students (Chase et al., 2013).
Each of the effect sizes of the 25 independent samples were converted to Hedges's g values to remove a positive bias associated with Cohen's d when sample sizes are small (Lipsey and Wilson, 2001). Next, each effect size was weighted using a random effects model and the average weighted effect size reported was a g value of 0.68. Warfa also examined the data for evidence of publication bias and the presence of moderator effects. For publication bias, the funnel plot method suggested a bias in favor of small sample sizes. Follow-up trim-fill investigations suggested the bias would not appreciably change the reported results. For moderators, Warfa explored the role of class size and geographic location on the effectiveness. The results suggest that CL was most effective in non-US-based locations though this could be attributed to the grade level where non-US-based studies were disproportionately at the high school level. The results also suggested that CL was most effective with small class sizes (less than 50 students) though the author cautions that too few classes were present with medium and large class sizes to make a definitive conclusion.
Warfa's analysis shows that overall CL is more effective than traditional instruction and the moderators chosen indicate geographic location and grade level where CL is used are related to the observed effectiveness. Warfa's analysis also revealed that potentially other modifiers could explain the heterogeneity observed among the articles analyzed. This study seeks to explore the role of additional potential modifiers among Warfa's data set with the intent of further defining where CL is effective. In particular, modifiers are identified related to instructor decisions regarding the implementation and assessment of CL. Examples of instructor decisions are: the type of assessments used, the extent cooperative learning is incorporated into the class and the maximum group size permitted. The focus on instructor decisions is purposeful in that it can inform instructional practices regarding either how to enact cooperative learning or under which classroom conditions cooperative learning would have an expected benefit. For a hypothetical example, if CL is found to not be effective with closed-ended exams then instructors who rely entirely on closed-ended exams may elect to either not employ CL or change the nature of their exams.
Meta-analysis has also been described as a means for establishing benchmarks for effect size that is more directed toward a particular area of study. As Lipsey et al. (2012) argue, a meta-analysis can provide a better description then Cohen's benchmarks of small, medium and large effect sizes, as meta-analysis presents empirical evidence of the effect sizes found in the relevant research literature. Thus a meta-analysis on the use of cooperative learning to impact students’ chemistry achievement can provide a description for expected effect sizes for pedagogical interventions in chemistry. Finally, meta-analyses provide an overview of a considerable body of research literature for instructors and researchers; such overviews are becoming more necessary given the recent increase in research article production in chemistry education (Ye et al., 2015).
(1) Under which conditions, as determined by instructor discretion, has CL been found to be effective for chemistry instruction?
(2) What is the range of effect sizes observed for CL in chemistry that can establish small, medium and large effect sizes for pedagogical interventions in chemistry?
Thus each assessment was categorized based on the above assessment constructs and each sample was categorized based on the CL usage construct and group size construct. In cases where insufficient information was available to categorize on one of the above constructs, an effort was made to contact the corresponding author of the article for additional information. Ultimately, if the information could not be obtained, the sample in the study was counted as missing toward that construct and not included in either of the two categories associated with the construct. The categorization of each article is presented in Table 1.
The observed effect sizes for each assessment were averaged to create one effect size for each independent sample (Lipsey and Wilson, 2001). If an article had four closed-ended exams they were averaged and presented as one data point in the closed assessment category. On occasion, a study employed different types of assessment within the study, herein referred to as split articles. For example, Hemraj-Benny and Beckford (2014) used a short-answer (non-closed) assessment for Exam 1 and a multiple-choice (closed) assessment for Exam 2. Split articles were not included in either category that they split. In the Hemraj-Benny example, the split article was not considered in the Assessment Type construct. This decision is revisited in an additional analysis presented in the appendix, where split articles are added to one relevant category and the analysis conducted, then removed and added to the alternative relevant category. Split articles only impacted Assessment Type and Assessment Coverage constructs.
The meta-analysis used a random effects model, similar to Warfa's original analysis, owing to the heterogeneity present among the articles. This decision assumes that there is variability between studies that is randomly distributed. In the random effects model, articles are weighted based on sample size and effect size observed in each article (fixed effect component) and a term that considers the variability between the studies in the article (random effect component) (Lipsey and Wilson, 2001, pp. 116–120). The assignment of article weights was performed once on the entire corpus of articles (to model variance of the population) and these weights were used throughout the remaining analyses.
For analysis of the data, first construct overlap was considered by conducting a chi-square between each possible construct. Evidence of construct overlap may mean that differences observed for one category are the result of another category, in essence a covariate relationship. Second, for each category, the weighted average effect size was determined and a 95% confidence interval was created. The confidence interval allows a determination of whether CL use in a specific category has a statistically significant impact that is greater than zero. Finally, the effect sizes for the dichotomous categories within each construct were compared using the Qb statistic as described by Lipsey and Wilson (Lipsey and Wilson, 2001, p. 121). The Qb statistic is analogous to conducting an independent sample t-test between two categories to determine whether one category has an effect size that is significantly greater than the effect size in the other category. The decision regarding split articles described earlier ensures that the samples described in each category are exclusive with no over-lap between them.
The number of studies for each category and the weighted mean effect size with a 95% confidence interval are presented in Table 2. The data presented in Table 2 indicates that CL is a robust intervention that has produced a positive, statistically significant outcome in each categorization except for cumulative assessments. The testing of each category relative to an effect size of zero is significant at p < 0.05 when the lower confidence limit is positive. The negligible effect size associated with cumulative assessments is particularly noteworthy. This effect size may represent a limitation of CL in terms of content retention as measured by a cumulative exam, a finding that echoes a call for further research on longitudinal impact of pedagogical reforms (National Research Council, 2012; Lewis, 2014). The Qb statistic for single versus cumulative indicates that the higher student performance on single topic exams versus cumulative exams is statistically significant as well. The description of only six studies that used cumulative exams is partially attributed to the decision to omit split articles. In the follow-up analysis (see Appendix), three split articles additionally contribute to the cumulative category with consistent results. Of the nine articles that used cumulative assessments, the effect sizes ranged from −0.497 to 0.330 with four of the nine articles having negative effect sizes and two others having effect sizes less than 0.050.
Construct | Category (N) | Weighted mean effect size | 95% confidence limit | Q b statistic |
---|---|---|---|---|
a Q b statistic significant at p < 0.01. | ||||
Assessment type | Closed (9) | 0.783 | [0.387, 1.178] | 0.96 |
Non-closed (11) | 0.515 | [0.153, 0.876] | ||
Assessment coverage | Cumulative (6) | −0.088 | [−0.479, 0.392] | 16.3a |
Single (13) | 1.12 | [0.78, 1.45] | ||
CL usage | Periodic (9) | 0.433 | [0.037, 0.830] | 1.05 |
Consistent (16) | 0.678 | [0.378, 0.978] | ||
Group size | Four or less (14) | 0.443 | [0.122, 0.764] | 1.85 |
Five or more (8) | 0.813 | [0.388, 1.237] | ||
Every study (25) | 0.586 | [0.339, 0.834] |
CL was effective in the remaining categorizations ranging from an average effect size of 0.433 to 0.834. CL reported a higher average effect size when implemented consistently, with maximum group size of five or more and with closed ended assessments. Additionally, the Qb statistic did not identify any of these categorizations as significantly more effective then their counterpart within the construct. Instead, the results then speak to the robustness of CL as a pedagogical tool in a variety of scenarios.
Literature guidelines have suggested group sizes of four students or fewer for CL, indicating that when group sizes exceed four students, individuals tend to communicate less frequently (Cooper, 1995; Johnson and Johnson, 2009). However, when studies reported a maximum group size of five or more the impact was larger, though statistically comparable, to those with a maximum group size of four or less. The impact of group size is likely influenced by the demands of the setting such as the instructor to student ratio and the physical placement of students to promote interactions. Lecture-halls, where seats are fixed in a row pointing in the same direction, may struggle with group sizes of five or more because students cannot easily interact. Alternatively, a round table with five seats may represent an effective group work set-up. Variables related to the setting would need to be examined further in the research literature to aid making a definitive claim related to the impact of group size, but the articles analyzed here suggest that groups sizes of five or larger can be effective.
The second research question sought to identify the range of effect sizes observed for CL in chemistry to set a benchmark for other educational interventions. Toward that end, the meta-analysis conducted here found a 95% confidence interval of the entire corpus of studies to range from 0.34 to 0.83 with a midpoint of 0.59. This range can be thought of as defining small (0.34), medium (0.59) and large (0.83) effects for attempts to improve chemistry learning through pedagogical intervention. These benchmarks can be thought of as fluid and are expected to evolve as future reviews of research in chemistry education are conducted. These results are in line with Warfa's previous meta-analysis that found an average effect size in chemistry of 0.68 or a recent meta-analysis on college-level STEM performance that found an average effect size of 0.47 (Freeman et al., 2014; Warfa, 2016). Put in context, future work that employs a pedagogical intervention to impact chemistry achievement with an effect size less than 0.6 may be viewed as less effective than CL. A special exemption to this would be if the effect size were observed on cumulative assessments where such effect sizes were not observed with CL.
One possibility for expanding the number of studies would be to add additional keywords to the search term including, for example “group work” or the use of widely disseminated reform efforts which incorporate CL such as “Process-Oriented Guided Inquiry Learning” or “Peer-Led Team Learning” (Gosser and Roth, 1998; Moog and Spencer, 2008). Future work is planned to expand the meta-analysis to incorporate these variants in CL methodology. Such work can then consider the effectiveness of these methodologies by using them as moderators within a meta-analysis.
Variable (N) | Weighted mean effect size | Standard error | 95% confidence limit | Q b statistic |
---|---|---|---|---|
Closed with splits (10) | 0.772 | 0.191 | [0.397, 1.147] | 0.94 |
Non-closed (11) | 0.515 | 0.184 | [0.153, 0.876] |
The inclusion of this article in the closed category resulted in a minimal change from a weighted mean effect size of 0.783 (Table 2) to 0.772 (Table 3). The resulting confidence interval and Qb statistic lead to similar interpretations that CL had a positive, significant effect with closed assessments that were not statistically different than non-closed assessments.
Next, the split article's non-closed assessment was added to the non-closed category and compared to the original closed assessment in Table 4.
Variable (N) | Weighted mean effect size | Standard error | 95% confidence limit | Q b statistic |
---|---|---|---|---|
Closed (9) | 0.783 | 0.202 | [0.387, 1.178] | 0.92 |
Non-closed with splits (12) | 0.526 | 0.176 | [0.180, 0.871] |
The impact of the non-closed assessment was also minimal changing the weighted mean effect size from 0.515 (Table 2) to 0.526 (Table 4). The category non-closed assessments remained significantly greater than zero and not statistically different than closed assessments.
The weighted mean effect size in the cumulative assessments had a minor change from −0.088 (Table 2) to −0.020 (Table 5). This value can still be interpreted as a negligible effect on student achievement with a confidence interval that still ranges across zero. Additionally, the effect of CL on cumulative assessments was still found to be significantly below the effect of CL on single-topic assessments.
Finally, the single topic assessments with the three split samples were added to the single topic assessment category in Table 6.
The inclusion of three articles into the single topic assessments had the largest change observed, moving from 1.12 (Table 2) to 0.90 (Table 6). The resulting value remains significantly greater than zero and significantly greater than cumulative exams.
Overall, the above analysis can be viewed as a sensitivity test of the original analysis to the decision regarding split articles. The results indicate that the most noteworthy finding, the lack of impact for CL on cumulative exams, is not impacted by the original decision to omit split articles. In addition, none of the other categories would change the interpretation of the effectiveness of CL presented in the original analysis.
This journal is © The Royal Society of Chemistry 2017 |