Doing it for themselves: students creating a high quality peer-learning environment

Kyle W. Galloway; Simon Burns

doi:10.1039/C4RP00209A

View PDF VersionPrevious ArticleNext Article

DOI: 10.1039/C4RP00209A (Paper) Chem. Educ. Res. Pract., 2015, 16, 82-92

Doing it for themselves: students creating a high quality peer-learning environment

Kyle W. Galloway * and Simon Burns
School of Chemistry, University of Nottingham, Nottingham, NG7 2RD, UK. E-mail: kyle.galloway@nottingham.ac.uk

Received 7th October 2014 , Accepted 20th October 2014

First published on 20th October 2014

Abstract

To support our students during their study and exam preparation we have developed a novel synoptic revision exercise using the online PeerWise system. Academic staff involvement was passive after introducing the assignment to the cohort via scaffolding activities, thus generating an entirely student-led peer-learning environment for the task. Student engagement exceeded all expectations with high levels of activity and peer-learning occurring over a wide range of topics. We report on a detailed investigation of the quality of the student-generated content, involving two years of data with separate cohorts.† The analysis includes classification of the student question type (revised Bloom's taxonomy), investigation of the utility of the feedback/model answers, along with time-resolved analysis of activity during the coursework window. The research seeks to reveal the nature of student behaviour in a peer-review environment and alleviate some of the common concerns held by academics considering moving to this type of activity.

Research questions

What is the quality of the student-generated content created in an online peer-learning environment? What is the nature of student contribution and engagement in a peer-learning task?

Background

PeerWise learning environment

PeerWise (http://peerwise.cs.auckland.ac.nz) is an online peer-learning environment that allows a class of students to create their own database of multiple choice questions (Denny et al., 2008). A student can write their own multiple choice question and submit this to the system along with the model answer to act as feedback for other class members. The students can answer the other questions in the database contributed by their cohort, allowing peer-learning and sharing of course content. After answering a question the student may then rate that question on both the difficulty and quality aspects, thus generating peer-review of the content. A comments system enables additional student peer-learning via discussion of ideas, feedback on the question, additional queries and responses which often lead to amplification or clarification of the model answer.

The creation of a good quality multiple choice question is a challenging task since it must include the correct model answer along with carefully written ‘incorrect but plausible’ distractor choices. In some cases these distractors can actually be designed to be the results obtained from common mistakes or misconceptions, consequently requiring careful consideration of answering strategies. It could therefore be argued that creating as well as answering multiple choice questions leads to a greater depth of understanding (Draper, 2009). By implementing this task within a peer-learning and peer-review environment we also hoped to encourage high level attributes such as self-reflection, communication and problem-solving skills (Boud et al., 1999; Topping, 2005). A study in the UK has investigated “Skills required by new chemistry graduates and their development in degree programmes” (Hanson and Overton, 2010) and identified skills which are desired in employment but underdeveloped as part of degree programs. These nine development deficits are mainly generic/transferable skills that are not always explicitly taught. The inclusion of a PeerWise activity in a course gives experience of seven of these nine areas including: numeracy and computational skills; report writing skills (i.e. written communication); information retrieval skills; problem-solving skills, team-working skills (i.e. whole class collaboration on a common task); time management/organisation skills; independent learning ability. The task is therefore beneficial for the development of these desirable generic/transferable student skills, in addition to supporting the study and consolidation of key subject-specific skills and core knowledge.

PeerWise is used around the world in over 700 universities, schools and technical institutes. Literature reports include analysis of undergraduate course implementations in Computing Science (Denny et al., 2010), Engineering (Denny et al., 2009), Physics (Bates et al., 2012), Veterinary Science (Rhind and Pettigrew, 2012), Biochemistry (Bottomley and Denny, 2011), Chemistry (Ryan, 2013), along with our reports of a UK cross-disciplinary study in Physics/Chemistry/Biology (Casey et al., 2014; Hardy et al., 2014). Two key summary results are that: students tend to contribute more questions, answers and comments than the minimum requirements for their course credit; students with higher PeerWise activity in a class show correlation with higher summative assessment performance. In relation to the course implementation presented in this paper, more detailed discussion of the student engagement characteristics (Casey et al., 2014) and the relationship between PeerWise use and examination performance (Hardy et al., 2014) are already available in the literature.

There are fewer studies concerning investigation of the quality of the student-generated content in PeerWise. Published findings with expert ratings are encouraging so far, with examples where the majority of questions have been classified as good (Hakulinen, 2010; Hakulinen and Korhonen, 2010) and a study that considers topic coverage, question quality and difficulty to be of high standard (Purchase et al., 2010). Another investigation (Bottomley and Denny, 2011) found that over 90% of questions were correct, while approximately half of the incorrect questions had been recognized by other students, thus highlighting the utility of the peer-interactions available through the system. A recent publication (that is of particular relevance to the study reported here) concerns a UK first year physics course implementation (Bates et al., 2014), where 75% of the questions were classified as high quality by fulfilling the criteria that they are clear, correct, require more than simple factual recall to answer, and possess a correct solution and plausible distractors. The researchers also mapped question quality onto the levels in the cognitive domain of Bloom's taxonomy. Their classification strategy has been adopted as a basis for the new chemistry study reported here.

PeerWise course implementation

In this study, PeerWise was implemented into the course as a synoptic revision exercise in a year-long chemistry module for first year chemistry students. The activity was introduced to the students as a revision task to help them with their studies and preparation for the end-of-year module exam. The coursework task was worth 5% of the module mark as an encouragement for students to engage, as it has been found that entirely formative ‘optional extra’ tasks can have lower participation (Luxton-Reilly and Denny, 2010; Rhind and Pettigrew, 2012) and the success of the coursework task as a collective resource is dependent on student contributions. The minimum expected engagement for a student was to: write 1 question; answer 5 questions; rate and comment on 3 questions; while the final coursework mark also had a dependence on the PeerWise system reputation score that is based on a variety of participation factors and class interactions. The assessed task duration was an eight week period towards the end of the module, although the database remained fully available to students even after the coursework deadline to allow continued use in exam preparation. Academic staff involvement was passive after introducing the assignment to the cohort via scaffolding activities, thus generating an entirely student-led peer-learning environment for the task. This was a deliberate policy so that students could take full ownership of the material, without being inhibited by the constant presence of academic staff or becoming reliant on a ‘teacher safety net’ of moderation. Student engagement exceeded all expectations with high levels of participation during term time, over the Easter holiday, with continued (although understandably reduced) activity after the coursework deadline and into the exam period, as shown in Fig. 1.


	Fig. 1 Plots for 2012 cohort student engagement over time, with number of questions (top) and answers (bottom) submitted per day. The coursework deadline, holiday period and final exams are highlighted in red on the timeline.

Obtaining these very high levels of engagement for this coursework task has been found to be reproducible over several cohorts of students, as shown in Table 1. This has also been observed in PeerWise implementations in other University of Nottingham chemistry modules for multi-subject degree programmes, but an analysis of those mixed cohorts are outwith the scope of this particular study.

Table 1 Comparison of the minimum expected engagement levels with actual numbers submitted to task. The number of students in the cohort is equal to the expected number of questions

Cohort	Questions		Answers
Cohort	Expected	Submitted	Expected	Submitted
2012	163	540	815	15329
2013	182	597	910	16442

Quality investigation

Student grouping and content sampling methodology

For each cohort of students the class was initially divided into four quartiles (Q1 lowest to Q4 highest) using the results of a pretest, which was a class test taken in the same module during the previous semester and before any of the PeerWise activities had taken place. The one hour in-class test consisted of multiple choice questions on topics covered in the early part of the module. Students with the same pretest score in groups that straddled a quartile borderline were randomly assigned to the higher or lower quartiles using a value from a random number generator. Within each quartile the students were split into two subsets based on the PeerWise reputation score for the student: those with a score higher than the median value of the class were designated High PeerWise Activity (HPA) students, while those with a score below the median value were designated Low PeerWise Activity (LPA) students. As a requirement of the grouping system, students who did not have a complete set of data including pretest, PeerWise engagement and final exam results were excluded from the analysis (5 in 2012 class and 14 in 2013 class). This investigation was carried out for two independent cohorts of students from 2012 and 2013, however for clarity the analysis plots using quartile splits will be shown for the 2012 cohort which also has further details available in previous reports (Casey et al., 2014; Hardy et al., 2014). The 2012 and 2013 cohort student behaviour was generally very similar and the plots displayed the same overall trends unless otherwise stated in the discussion.

Analysis of the module summative assessment shows a positive and significant correlation between PeerWise activity and student performance in end-of-year examinations (PeerWise reputation score versus mean exam total, Pearson R = 0.377, p ≪ 0.001). Fig. 2 shows a representation of this correlation with the student grouping method applied, highlighting the performance of the LPA/HPA students in the quartiles. This would suggest that the synoptic revision coursework activity is successfully fulfilling its intended purpose, with improvements that are not simply due to the inherent prior ability of the students. Further investigation of the contributed content quality and engagement behaviour of these quartiles and their subsets is therefore of interest. We would characterise the research as quasi-experimental, with an element of case-study.


	Fig. 2 Exam performance correlation with PeerWise activity for students in the 2012 cohort. Number of students in subsets (LPA:HPA) are Q1(24:16), Q2(24:17), Q3(19:22), Q4(13:28). Error bars are the standard error of the mean.

The full student class list obtained from the quartile grouping method was used for selection of samples from the database of student-generated content. The objective was to be able to compare contributions from the different quartile groups and the LPA/HPA subsets. For expert review, initially one HPA and one LPA student from the centre of each of the four quartiles were selected. All of the contributions of these students were evaluated for quality using the methodology described in the next section. The process was then repeated such that every cycle covered students from all quartiles and subsets, building up coverage of the overall student class list. In cases where a student had submitted more than one question, all contributions were classified separately and then an average was used to represent the student in the analysis list. The analysis was applied to both the 2012 and 2013 cohorts in order to investigate whether observed results were reproducible. The coverage of the samples for expert review is shown in Table 2.

Table 2 Expert review content sample sizes with percentage of total set^a

Cohort	2012	2013
a Note that many factors considered in this research were available from an analysis of the data output from the PeerWise system; therefore aspects such as numbers of questions/answers submitted could be investigated using data for the full cohort of students. The system data can also provide the peer-review ratings for question difficulty and quality. For factors with system data available for the full cohort, the quartile grouping method was used for analysis.
Students sampled	64 (39%)	72 (40%)
Questions sampled	210 (39%)	210 (35%)

Classification methodology

The question/solution categorization approach developed by Bates, Galloway, Riise and Homer (Bates et al., 2014) was used as a basis for the study reported here. The cognitive level of student questions were classified based on Anderson and Krathwohl's revised version (Anderson and Krathwohl, 2001) of Bloom's taxonomy (Bloom, 1956), as shown in Table 3.

Table 3 Categorization levels and explanations for the cognitive domain of Bloom's taxonomy

Level	Identifier	Description
1	Remember	Factual recall, definition.
2	Understand	Basic understanding, no calculation necessary.
3	Apply	Implement, calculate or determine. Single topic calculation or exercise involving application of knowledge.
4	Analyse	Multi-step problem, requires identification of problem-solving strategy before executing.
5	Evaluate	Compare and assess various option possibilities, often qualitative and conceptual questions.
6	Create	Synthesis of ideas and topics from multiple course topics to create significantly challenging problem.

It should be noted that during the classification procedure the question and solution provided by the author were both visible, along with the comments posted about the question by the other students and/or the author. This therefore provided a greater contextual overview for the rating of the question, assisted by the intended answering strategy being visible, rather than just the question being treated in isolation. The quality of solution to the question was rated based on the utility of the model answer as informative feedback for other students. The categorization levels are described in Table 4.

Table 4 Categorization levels for explanation of solution to questions

Level	Identifier	Description
0	Missing	No explanation provided or explanation incoherent.
1	Inadequate	Wrong reasoning and/or answer. Solution may be trivial, flippant or unhelpful.
2	Minimal	Correct answer but with insufficient explanation or justification. Some aspects may be unclear or incorrect.
3	Good	Clear and sufficiently detailed exposition of both correct method and answer.
4	Excellent	Thorough description of relevant chemistry and solution strategy. Contains remarks on plausibility of answer and/or other distractors. Beyond normal expectations for a correct solution.

In addition, further quality indicators were developed as shown in Table 5, which includes consideration of the overall coherence of the question and set of answer choices. The number of answer options available is another factor that could affect the ability to simply guess an answer (which also depends on the plausibility of distractors). The validity of the answer was considered based on whether the suggested answer is actually correct; and also whether the most popular answer selected by the class matches the author's answer, since a discrepancy may indicate a poorly formed question or unintended outcome. The evaluation of the comments relates the possibility of their use as a mechanism of peer-learning by the sharing of scientific content, which may be further enhanced to peer-review and content enhancement.

Table 5 Additional categorizations with level descriptors

Level	Question coherence
0	Not coherent.
1	Coherent.

Level	Answer options
0	1 or 2 answer choices.
1	3 answer choices.
2	4 or 5 answer choices but implausible distractor(s).
3	4 or 5 answer choices with plausible distractors.

Level	Answer validity
0	The author's suggested answer is incorrect.
1	Author's suggested answer is not the most popular but it is the correct answer.
2	Author's suggested answer is the most popular and it is the correct answer.

Level	Comments
0	All comments left on the question are phrases such as ‘Good question’ or ‘Good explanation’.
1	Comments left contain phrases of a scientific nature but no ‘discussion’.
2	Comments left suggest improvements and/or new ideas leading to discussion and/or improvements to the question as a whole.

Plagiarism was considered as a binary measure in response to ‘Is the question obviously plagiarised?’ with the caveat that it is clearly not practical to crosscheck contributions against all other students, textbooks and internet resources. However the researcher conducting this classification (S.B.) had taken the same undergraduate modules and was therefore familiar with the course materials encountered by these cohorts of students.

To begin the classification process both researchers (S.B. and K.W.G.) discussed the categorization levels and intended outcomes. Ten questions were then randomly selected from the student database and classified independently, before the rating results were compared between researchers. Similarities and differences were discussed and clarifications on the levels for future use were agreed. This iterative cycle was continued, using 5 randomly selected questions for classification each time, until consistent results were obtained. One researcher (S.B.) then carried out the classification of student content for both 2012 and 2013 cohorts. During and after the classification process, questions were selected at random for review of the ratings by a second researcher (K.W.G.). For the random selection processes the questions were only identified by the PeerWise system question ID number so that no other factors (such as topic area, question wording) would influence the choices, which were made from a mixed list of ID numbers.

Results and discussion

Quality of questions

The classification of the cognitive level of the questions using the revised Bloom's taxonomy (Anderson and Krathwohl, 2001) is illustrated in Fig. 3, which shows very similar distributions for the 2012 and 2013 cohorts. The vast majority of the contributions are beyond the simple factual recall ‘Remember’ type of question, although there is a significant proportion of ‘Understand’ type of question that requires a basic understanding of concepts but does not include calculations. The most common questions were those of ‘Apply’ type which include either a numerical calculation or application of knowledge to a problem. By contrast, there were fewer of the more complicated ‘Analyse’ type of multi-step problem questions, although there was a large proportion of ‘Evaluate’ type where several possible outcomes to a scenario must be compared (often qualitatively). Only a very small proportion of questions were of the advanced ‘Create’ type which brings together many course topics to form a challenging problem. It would therefore appear that even as a synoptic revision exercise the multiple choice questions tend to remain focussed on particular course topics. This actually may have a beneficial aspect, as in online feedback questionnaires many students reported adopting an exam preparation approach of revising a course topic and then selecting questions from the database of questions to check their study of that area. This is facilitated in the PeerWise system by the tagging system that allows students to associate questions with course topics or chemical concept areas, thus allowing desired questions to be rapidly filtered from the database.


	Fig. 3 Distribution of question type classification using the cognitive levels of the revised Bloom's taxonomy. Data is shown for 2012 and 2013 cohorts.

The distribution of question types including higher cognitive levels is pleasing, while also suggesting that the scaffolding activity had been successful. This introductory activity for the task had considered examples of good and bad multiple choice questions that had been created based on a typical lecture slide. The strengths and weaknesses of the examples were discussed in student groups, including a class feedback summary facilitated by an academic member of staff (K.W.G.) with the aim of illustrating the consideration required for the creation of a good quality multiple choice question. The students were specifically challenged to aim for more than just simple factual recall questions and it is clear from these results that they have responded very well to this objective. It is even more impressive considering previous results (Momsen et al., 2010) that show that 93% of questions written by a sample of fifty instructors of U.S. introductory biology courses were at the lowest two levels of the revised Bloom's taxonomy.

It is interesting to note that distributions for this first year chemistry course are very similar to those obtained from a first year physics course that is also taught in the U.K. (Bates et al., 2014), with one notable difference in the relative proportion of level 4 ‘Analyse’ and level 5 ‘Evaluate’ PeerWise questions. In the case of this chemistry study there are significantly more ‘Evaluate’ than ‘Analyse’ type, however the situation is reversed for the results of the physics course. This is perhaps reflective of the subtle differences in nature of these two physical sciences, with chemistry including more qualitative or conceptual evaluation questions, whereas physics has more defined multi-step problems that may require a sequence of mathematical treatments. By comparison, a study of PeerWise questions by second year biomedical students in Australia (Bottomley and Denny, 2011) found that 91% of questions contributed by students were classified as ‘Remember’/‘Understand’ with only 9% at the higher ‘Apply’/‘Analyse’ levels. However the authors note that this was to be expected for their course. Indeed, one must be cautious in comparisons due to the differences in degree course structures between subjects, institutions and countries. There is also a possibility of some variation in the particular interpretation of the taxonomy levels used for classification by different researchers.

The distribution of quality classifications for the explanation of solutions to the questions is shown in Fig. 4. Once again the behaviour of the two cohorts is similar, however it should be noted that all of the explanations that were graded as ‘Missing’ were from questions by the same 2012 student and this level 0 of explanation was not seen for any other student in the sample. There are comparatively few ‘Inadequate’ answers that are wrong or unhelpful, with the vast majority of solutions at ‘Minimal’ level or above, therefore including the correct answer. Most of the model answers were at ‘Good’ level which included full explanation of the required method, while an impressive proportion achieved the higher ‘Excellent’ level by including thorough discussion of answering strategy and consideration of the plausibility of distractors. The distribution is very similar to that obtained for the comparable physics study (Bates et al., 2014) discussed earlier.


	Fig. 4 Distribution of explanation classification for the solution provided to the question. Data is shown for 2012 and 2013 cohorts.

The utility of model answers was also one of the topics of discussion of the scaffolding activity, since this is an important resource in this synoptic revision exercise: for students who have chosen incorrectly the feedback should aim to resolve errors or misconceptions; while for students who have chosen correctly the feedback should aim to enhance comprehension, perhaps by displaying further detail or an alternative answering strategy. It should also be noted that the rapid timescale of the feedback provided on the system is very useful in this context.

It could be argued that a PeerWise task (including an associated scaffolding activity) that is fully integrated into a course can address the ‘seven principles of good feedback practice’ (Nicol and Macfarlane-Dick, 2006) by using a system that facilitates formation of a self-regulated learning environment for students (Denny et al., 2008; Bottomley and Denny, 2011; Hardy et al., 2014).

Table 6 gives summary data showing the percentage of questions that meet the specified threshold criteria for the various categorizations. The two cohorts once again display similar characteristics, so data is also provided as a combined cohort that includes all 420 sampled questions in the study. Overall, 86% of the questions in this study were classified as being a ‘High quality question’ by fulfilling all of the threshold criteria stated in Table 6. These questions are therefore coherent, correct, require more than simple factual recall, and possess a valid solution along with reasonable distractors. The results here compare favourably with the related physics study, where 75% of the sampled 602 questions were classified as high quality using very similar threshold criteria (Bates et al., 2014). Of particular note is that from the two different subject samples only 4% of the chemistry questions and 5% of the physics questions were rejected on the basis of having an incorrect answer. Anecdotally, the presence of incorrect answers is a common concern expressed by instructors about peer-learning tasks; these findings (from tasks that were not actively moderated by academic staff) are therefore reassuring. It should also be noted that incorrect answers are often highlighted by other students via comments (Bottomley and Denny, 2011).

Table 6 Percentage of questions in expert review samples that meet quality threshold criteria. Also shown is the percentage of cases where an individual question fulfils all six quality thresholds to qualify as a ‘High Quality Question’. Data is shown for 2012, 2013 cohorts and as combined set

Quality level threshold	2012 (%)	2013 (%)	Combined (%)
Cognitive level ≥ 2	95	94	95
Explanation level ≥ 2	93	95	94
Coherence level = 1	99	100	100
Answer options level ≥ 1	89	93	91
Answer validity level ≥ 1	95	98	96
Not obviously plagiarised	100	100	100
‘High quality question’	84	88	86

It was of interest to compare the question quality ratings obtained from student peer-review with those obtained by expert-review, as shown in the scatter plot of Fig. 5. The peer-review score is obtained from student quality ratings in the PeerWise system, which operates on a scale of 0 (very poor) to a maximum of 5 (excellent). For comparison, the expert-review score takes each of the six categorizations in Tables 3–5, scales the categorization level to a fraction X/1 and then combines the fractions to give an aggregate score for the question with a maximum value of 6. There will inevitably be differences in the ratings obtained due to the fact that students rate on a single 0–5 quantised scale whereas the expert score is comprised of many detailed criteria. A student rating may not necessarily include consideration of the impact of student comments/author replies (although these are visible to them at the time of rating), however this was included in the expert aggregate score as means of including the peer-learning aspect associated with this additional content related to the question. An analysis of the data reveals that there is a positive and significant correlation between the peer-review and expert-review quality ratings (Pearson R = 0.283, p ≪ 0.001). This would suggest that the students are being reasonably fair and considered in their rating choices, making it a useful aspect of the PeerWise system.


	Fig. 5 Correlation of peer-review and expert-review question quality ratings. Data is shown for 2012 and 2013 cohorts combined.

In addition to rating the quality of a question, the students may also rate the difficulty of the question on a quantised scale of 0 (easy) to 1 (medium) to 2 (hard). The scatter plot of Fig. 6 compares the peer-review difficulty and quality ratings for the same sample of questions as used for Fig. 5. It is interesting to note that there is a positive and significant correlation between the difficulty and quality ratings given by students (Pearson R = 0.527, p ≪ 0.001). While a higher quality question of a greater taxonomy level may be expected to be more difficult due to the more complicated answering strategy that needs to be employed, it is perhaps still surprising that there is such a notable correlation as seen here. The sample also seems to suggest a tendency towards easy/medium difficulty questions as opposed to hard, however this was not an aspect specifically considered by the expert opinion rating.


	Fig. 6 Correlation of peer-review ratings for the difficulty and quality of the questions. Data is shown for 2012 and 2013 cohorts combined.

It is also useful to investigate the course topics covered by the questions written by the students. The distribution of the submitted questions with respect to the three classical branches of chemistry (2012/2013) was found to be: inorganic (22%/20%), organic (36%/34%) and physical (42%/46%). It can be seen that inorganic topics are the least common, however it should be noted that this is likely due to the eccentricities of this particular course as the majority of inorganic chemistry was delivered in a different module. There is also clearly a caveat in terms of classification of question topics, particularly in areas at the interface between the classical branches. Physical chemistry questions are more common than organic chemistry, which may be due to the mathematical treatment of many physical problems readily lending themselves to a multiple choice question format.

Student behaviour in quartiles and subsets

Having investigated the quality of the student-generated content, we now address the second research question: What is the nature of student contribution and engagement in a peer-learning task? To consider this aspect, the student grouping system was used to evaluate the behaviour of the different class quartiles and their HPA/LPA subsets. Fig. 7 shows the average quality ratings for the questions submitted by the different groups within the class, using the expert-review aggregate score described earlier. It is found that this is very similar across the entire cohort, with little difference between quartiles or HPA and LPA students. This is encouraging and suggests that good quality contributions can be obtained from the whole class, rather than just being limited to those of higher prior ability. However, investigation of the number of questions submitted per student reveals that HPA students in all quartiles are posting significantly more questions to the database than their LPA counterparts, as shown in Fig. 8. On average, HPA students are submitting at least double the number of questions of LPA students. For HPA students, it also appears that there may be a small trend of greater number of submissions from higher quartiles (although the 2013 cohort Q4 HPA result is anomalously low in this regard, yet still over double the LPA number). Nevertheless, it is apparent that the HPA students from all quartiles are responsible for generating the majority of the questions posted to the task.


	Fig. 7 Average expert-review question quality rating for the 2012 cohort subsets. The rating is an aggregated score from the levels of the six different categorizations investigated. Error bars are the standard error of the mean.


	Fig. 8 Average number of questions posted by students in the 2012 cohort subsets (full cohort data sample). Error bars are the standard error of the mean.

In terms of the answers submitted to the task by students, Fig. 9 shows the average percentage of ‘correct’ answers per student in each subset. Note that ‘correct’ in this case refers to the student choice matching the answer specified by the author of the question, although there may be some instances of author errors. There appears to be a slight trend of increasing percentage ‘correct’ towards the higher quartiles, though more interestingly there is no significant difference in performance between HPA and LPA students within a quartile. Further analysis of the average number of answers submitted by these students reveals that HPA students answer significantly more questions than LPA students, as shown in Fig. 10. So even though the overall answering success rate is similar for HPA and LPA subsets, the HPA students are clearly attempting far more questions, often in excess of four times as many as LPA students (with a more pronounced difference at approximately seven times as many observed for the 2013 cohort). We postulate that the HPA students benefit by gaining more experience and a wider coverage of question types and topics through this greater engagement.


	Fig. 9 Average percentage of ‘correct’ answers submitted by students in the 2012 cohort subsets (full cohort data sample). Error bars are the standard error of the mean.


	Fig. 10 Average number of questions answered by students in the 2012 cohort subsets (full cohort data sample). Error bars are the standard error of the mean.

One mechanism of sharing experience from answering the questions is for students to use the PeerWise comments system. As part of the expert-review of the questions, the comments left by students were categorized as: level 0 comments of simple opinion; level 1 comments including scientific content; level 2 comments leading to scientific discussion and improvements. The average rating for the type of comments left by students of the subsets is shown in Fig. 11. Here we see that across the quartiles the HPA student comments are of higher quality and have moved into the range of scientific content with discussion/improvements, whereas the LPA student comments have lower quality content overall. For the 2012 cohort, a high proportion of questions (68%) had comments made on them that were of a chemistry nature or improved the question; this was lower for the 2013 cohort but still over a third of the questions (37%) included such content. The helpful comments (i.e. not just ‘good question’ or ‘nice one’) offered advice on different ways to solve problems, corrected mistakes, clarified unclear points, added to the given explanation, posted websites or references to books and lecture material relevant to the question. These comments therefore acted as a useful information resource as part of the student-generated content.


	Fig. 11 Average expert rating for the contents of the comments left by students in the 2012 cohort subsets. Note that 2013 cohort Q2 HPA and LPA show no difference in rating. Error bars are the standard error of the mean.

Consideration of the time spent on the task revealed that the HPA students have significantly more active days than LPA students (typically by a factor of three), as shown in Fig. 12. The PeerWise system defines an active day as being when at least one question is authored or one question is answered. It would therefore seem that time on task is one of the key differentiating factors between the student subsets: HPA students can write more questions, answer more questions and contribute more useful comments by investing more time in the use of the system to achieve greater engagement.


	Fig. 12 Average number of days active on the PeerWise system by students in the 2012 cohort subsets (full cohort data sample). Error bars are the standard error of the mean.

Time-resolved engagement study

Given the relation of contributions to time spent on task, it is also worthwhile to consider a time-resolved analysis of engagement over the timeline, which includes four key periods: pre-holiday teaching term (2 weeks); intermediate spring holiday (4 weeks); post-holiday teaching term (2 weeks); the time after the coursework deadline, including revision and exams. Note that only the engagement that occurred during the indicated 8 weeks before the deadline contributed to the coursework mark for the module.

Fig. 13 shows the number of questions posted by the quartiles of students during the timeline. It is clear that in the early stages of the task within term time that the Q4 students were contributing substantially more questions. Note that the database was initially completely empty; therefore the higher prior ability students are leading the way with first examples of questions submitted to the task. Moving into the holiday period (when the students are relieved of many other time pressures), the top two quartiles Q3 and Q4 were still posting the most questions, however Q1 and Q2 now had much more activity in comparison to the first part of term time, so a large number of questions were submitted during this vacation. This highlights the utility of PeerWise to enable peer-learning even when the students are not necessarily on campus. With the return to the teaching term the behaviour of all quartiles became more similar, although the rate of submission actually remained high (as this 2 week period is in comparison to the 4 week holiday). After the coursework deadline, rather unsurprisingly, there were no further questions contributed.


	Fig. 13 Number of questions posted on the PeerWise system during the timeline periods by students in the quartiles of the 2012 cohort (full cohort data sample).

The quality of the questions obtained during the four time periods was investigated by expert-review as shown in Fig. 14. The aggregate expert-review quality score remains pleasingly high over the timeline, with similar performances from all quartiles (the peer-review quality data is analogous). An analysis of the taxonomy level of the submissions reveals a good mixture of question types during the three periods, with proportions for each that are broadly reflective of the overall distribution seen for the cohort. The sustained high quality over the (8 week) task is impressive: we postulate that this may related to the students starting with a high baseline quality expectation via the scaffolding activities, while the large number of questions posted at the start of the task by the Q4 students of high prior ability again serve as a high quality student benchmark in terms of the initial content encountered by the class.


	Fig. 14 Average expert-review quality rating for the questions posted by the quartiles of the 2012 cohort during the timeline periods. The rating is an aggregated score from the levels of the six different categorizations investigated. Error bars are the standard error of the mean.

The number of questions answered by the class during the timeline periods is shown in Fig. 15. Once again the top two quartiles, particularly Q4, lead the way in the number of answers submitted in the early part of the task during term time. This means that these particular students are able to rate and leave comments on more questions, since peer-review occurs after question completion. Q1 and Q2 answering submissions pick up during the holiday period, though Q3 and Q4 still have substantially more activity during this time. In general it would seem that Q1 and Q2 contributions to posting and answering questions mostly occur in the middle-later parts of the timeline.


	Fig. 15 Number of questions answered on the PeerWise system during the timeline periods by students in the quartiles of the 2012 cohort (full cohort data sample).

After the coursework deadline the number of answers submitted is reduced in rate; however it is worthwhile to note the strong correlation of answering activity with the scheduling of the exam papers for this module, as shown in Fig. 1.

One possibility is tactical behaviour in the task where students would choose lower quality questions on the approach to the coursework deadline in an effort to boost their score and module mark. The low 5% value of the coursework was chosen to discourage such attitudes, but we can also investigate this using data from the task. Analysis of the expert-review quality of the questions selected by the students during the four periods is shown in Fig. 16. Here we see that there is a small reduction in the average quality of questions selected by all quartiles after the holiday period, in the final phase before and after the deadline. It should be noted that Fig. 16 is displaying a particular range expansion of the aggregate quality scale (maximum for scale is 6), so this is a relatively small difference. However if these tactics are done with the intention of considerably skewing the ratio of correct answers for the student to raise their score, then it does not appear to be particularly effective, since the percentage of ‘correct’ answers (that match the author selection) remains substantially in the same ranges during the three assessed periods, albeit with some fluctuations both up and down over time.


	Fig. 16 Average expert-review quality rating for the questions answered by the quartiles of the 2012 cohort during the timeline periods. The rating is an aggregated score from the levels of the six different categorizations investigated. Error bars are the standard error of the mean.

Related to this point, it has often been found from free-text feedback responses about the PeerWise coursework that there is a student perception of unfairness in the marking system due to peers who were “much more strategic (and so flooded the system with taxonomically lower-level questions immediately prior to the submission deadline)” (quote from Casey et al., 2014). In this study we seek to investigate the real behaviour of the class in these circumstances. To this end, we conducted additional analysis of the question contributions for each of the seven days preceding the coursework deadline. Fig. 17 shows that that the average expert-review question quality remained constantly high over the final week, with only a slight reduction on the very last day (although such a dip was not seen in 2013). These values are comparable to the typical ratings and overall quality level obtained for all question submissions (Fig. 14 and 7). It is also interesting to note that even the student peer-review quality ratings show the same results, with continuous quality of submissions.


	Fig. 17 Average expert-review quality rating for the questions submitted by the quartiles of the 2012 cohort during the final week before the coursework deadline. The rating is an aggregated score from the levels of the six different categorizations investigated. Error bars are the standard error of the mean.

In Fig. 18, the distributions of taxonomy level for the questions also reveal a good mixture of question types over the seven days, thus disproving the notion that lower level questions (particularly level 1 ‘Remember’ factual recall) would dominate the contributions in the closing stages. Therefore we find that quality ratings from both expert-review and peer-review show a high level of question quality throughout the task, which is even maintained during the final days towards the deadline. Since these results have been found to be reproducible for both the independent 2012 and 2013 cohorts of students, we can be confident that the student perception of unfair peer-behaviour is unsubstantiated, given the evidence available from this PeerWise coursework task implementation.


	Fig. 18 Distribution of question type classification, using the cognitive levels of the revised Bloom's taxonomy, for questions submitted during the final week before the coursework deadline. Data is for 2012 cohort.

Conclusions

In conclusion, this detailed study of the contributions made by students to an online peer-learning task has revealed that student-generated content is generally of a very high standard. Overall, 86% of the examples classified in this multiple cohort investigation were found to be a ‘High quality question’ by being coherent, correct, requiring more than just simple factual recall, and possessing a valid solution along with reasonable distractors. Student-authored explanations and peer-comments were found to be useful, while peer-review ratings of question quality displayed correlation with our expert-review ratings. High PeerWise Activity students have shown correlation with greater examination performance, in comparison to Low PeerWise activity students, yet the quality of submissions and answering success rate is similar for both groups. It would therefore appear that time on task is the differentiating factor: High PeerWise Activity students spend a greater number of days active on the system, allowing them to author more questions and answer more questions, thus providing them with a richer experience than Low PeerWise Activity students. Time-resolved analysis of contributions has shown a sustained high level of question quality for submissions, both during term time and even over a holiday period. Overall, PeerWise has been shown to be an effective online peer-learning environment for conducting a student-led synoptic revision exercise.

Acknowledgements

We wish to thank Paul Denny (creator and developer of the PeerWise system) for providing the extensive raw data required for this analysis and for his helpful comments.

Notes and references

Anderson L. W. and Krathwohl D. R., (2001), A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives, New York: Longman.
Bates S. P., Galloway R. K. and McBride K. L., (2012), Student-generated content: using PeerWise to enhance engagement and outcomes in introductory physics courses, AIP Conf. Proc., 1413, 123.
Bates S. P., Galloway R. K., Riise J. and Homer D., (2014), Assessing the quality of a student-generated question repository, Phys. Rev. ST Phys. Educ. Res., 10, 020105.
Bloom B. S., (1956), Taxonomy of educational objectives: the classification of educational goals, Hand-book I: Cognitive Domain, New York: Longman.
Bottomley S. and Denny P., (2011), A participatory learning approach to biochemistry using student authored and evaluated multiple-choice questions, Biochem. Mol. Biol. Educ., 39, 352.
Boud D., Cohen R. and Sampson J., (1999), Peer Learning and Assessment, Assessment & Evaluation in Higher Education, 24(4), 413.
Casey M. M., Bates S. P., Galloway K. W., Galloway R. K., Hardy J. A., Kay A. E., Kirsop P. and McQueen H. A., (2014), Scaffolding Student Engagement via Online Peer Learning, Eur. J. Phys., 35, 045002.
Denny P., Hanks B. and Simon B., (2010), PeerWise: Replication Study of a Student-Collaborative Self-Testing Web Service in a U.S. Setting, Proceedings of 41st ACM Technical Symposium on Computer Science Education, New York: Australian Computer Society, p. 421.
Denny P., Luxton-Reilly A. and Hamer J., (2008), The PeerWise system of student contributed assessment questions, Proceedings of the tenth conference on Australasian computing education ACE '08, Darlinghurst, Australia: Australian Computer Society, Inc., vol. 78, p. 69.
Denny P., Luxton-Reilly A. and Hamer J., (2009), Students Sharing and Evaluating MCQs in a Large First Year Engineering Course, Proceedings of 20th Australasian Association for Engineering Education Conference, p. 575.
Draper S. W., (2009), Catalytic assessment: understanding how MCQs and EVS can foster deep learning, Brit. J. Educ. Technol., 40(2), 285.
Hakulinen L., (2010), Using Computer Supported Cooperative Work Systems in Computer Science Education - Case: PeerWise at TKK, Master's thesis, Faculty of Information and Natural Sciences, School of Science and Technology, Aalto University.
Hakulinen L. and Korhonen A., (2010), Making the most of using PeerWise in education, Proceedings of ReflekTori 2010 Symposium of Engineering Education, Aalto University, Lifelong Learning Institute Dipoli, p. 57.
Hanson S. and Overton T., (2010), Skills required by new chemistry graduates and their development in degree programmes, Higher Education Academy, Physical Sciences Centre, University of Hull.
Hardy J. A., Bates S. P., Casey M. M., Galloway K. W., Galloway R. K., Kay A. E., Kirsop P. and McQueen H. A., (2014), Student-generated content: enhancing learning through sharing multiple-choice questions, Int. J. Sci. Educ., 36(13), 2180.
Luxton-Reilly A. and Denny P., (2010), Constructive evaluation: a pedagogy of student-contributed assessment, Comput. Sci. Educ., 20(2), 145.
Momsen J. L., Long T. M., Wyse S. A. and Ebert-May D., (2010), Just the facts? Introductory undergraduate biology courses focus on low-level cognitive skills, CBE Life Sci. Educ., 9(4), 435.
Nicol D. J. and Macfarlane-Dick D., (2006), Formative assessment and self-regulated learning: a model and seven principles of good feedback practice, Stud. High. Educ., 31(2), 199.
Purchase H., Hamer J., Denny P. and Luxton-Reilly A., (2010), The quality of a PeerWise MCQ repository, Proceedings of the Twelfth Australasian Conference on Computing Education ACE '10, Darlinghurst, Australia, Australia: Australian Computer Society, Inc., vol. 103, p. 137.
Rhind S. M. and Pettigrew G. W., (2012), Peer generation of multiple-choice questions: Student engagement and experiences, J. Vet. Med. Educ., 39(4), 375.
Ryan B. J., (2013), Line up, line up: using technology to align and enhance peer learning and assessment in a student centred foundation organic chemistry module, Chem. Educ. Res. Pract., 14, 229.
Topping K. J., (2005), Trends in Peer Learning, Educ. Psychol., 25(6), 631.

Footnote

† The research project was evaluated and conducted under the BERA Ethical Guidelines for Educational Research, the University of Nottingham Code of Research Conduct and Research Ethics and the e-Ethics@Nottingham: Ethical Issues in Digitally Based Research guidelines. All student data sources were anonymised for use by the project team. The research activities were separate from the assessed coursework task that created the student content. Opinion comments from students were obtained via an anonymous online feedback questionnaire that included free text responses. Note that support material, examples of scaffolding materials and technical guidance screencasts are available online at: http://www.peerwise-community.org/

Click here to see how this site uses Cookies. View our privacy policy here.