Abby E.
Beatty
*,
Abby
Esco
,
Ashley B. C.
Curtiss
and
Cissy J.
Ballen
Auburn University, Auburn, AL, USA. E-mail: aeb0084@auburn.edu
First published on 8th February 2022
To test the hypothesis that students who complete remote online tests experience an ‘online grade penalty’, we compared performance outcomes of second-year students who elected to complete exams online to those who completed face-to-face, paper-based tests in an organic chemistry course. We pursued the following research questions: (RQ1) Are there performance gaps between students who elect to take online tests and those who take face-to-face tests? (RQ2) Do these two groups differ with respect to other affective or incoming performance attributes? How do these attributes relate to performance overall? (RQ3) How does performance differ between students who reported equal in-class engagement but selected different testing modes? (RQ4) Why do students prefer one testing mode over the other? We found that students who elected to take online tests consistently underperformed relative to those who took face-to-face tests. While we observed no difference between the two student groups with respect to their intrinsic goal orientation and incoming academic preparation, students who preferred face-to-face tests perceived chemistry as more valuable than students who preferred to complete exams online. We observed a positive correlation between performance outcomes and all affective factors. Among students who reported similar levels of in-class engagement, online testers underperformed relative to face-to-face testers. Open-ended responses revealed online testers were avoiding exposure to illness/COVID-19 and preferred the convenience of staying at home; the most common responses from face-to-face testers included the ability to perform and focus better in the classroom, and increased comfort or decreased stress they perceived while taking exams.
One factor contributing to underperformance in chemistry may be choice of testing mode. Specifically, students who elect to take their assessments remotely online rather than face-to-face and on paper may experience a testing ‘penalty’. For this study, testing mode refers to the method of delivering a test to students: either remote online tests (hereafter ‘online’) or face-to-face paper-based tests (hereafter ‘face-to-face’). The testing mode effect refers to differences in student performance between tests given in different testing modes. In our study, students experienced the exact same format of test questions and wrote answers on the same hard copy answer sheets in online and face-to-face testing environments. While we do not seek to untangle the potential effects of where students completed their exams, the option to take exams at home (rather than face-to-face) is becoming increasingly common, as online courses surge in popularity. According to the National Center for Education Statistics, in 2018, over one-third of all undergraduate students engaged in distance education, and 13 percent of total undergraduate enrollment exclusively took distance education courses. Of the 2.2 million undergraduate students who exclusively took distance education courses, 1.5 million enrolled in institutions located in the same state in which they lived (Hussar et al., 2020). These values are expected to increase as online learning opportunities are cost-effective and students who are entering higher educations may have families, be involved in part-time or full-time jobs, or have other responsibilities. Online exams are also an integral part of our national efforts to promote diversity, equity, and inclusion, as students who request testing accommodations complete them remotely and the exams are often computer-based.
Some previous work defined testing mode slightly differently, as computer-based or paper-based exams taken in the same environment. One study found that if two students with equivalent competencies completed an assessment in the same testing location, the student who took the paper-based test outperformed the student completing the computer-based test. Specifically, Backes and Cowan (2019) examined test scores for hundreds of thousands of K-12 students in Massachusetts and demonstrated a testing mode effect; specifically, they found an online test ‘penalty’ of approximately 0.10 standard deviations in math and 0.25 standard deviations in English. However, other research disputes these results, with mixed outcomes presented in the literature. Meta-analyses of testing mode effects on K-12 mathematics test scores (Wang et al., 2007) and K-12 reading assessment scores (Wang et al., 2008) demonstrated no statistically significant effect of testing mode. While results on testing mode effects are mixed, less work has been conducted to explain these potential differences. As one of few studies performed to answer this question in undergraduate chemistry settings, Prisacari and Danielson (2017) administered practice tests in the form of computer-based or paper-based assessments to 221 students enrolled in general chemistry and found no evidence of testing mode effects between the two groups. Notably, students were assigned a testing mode based on scheduling availability, not testing mode preference, and researchers administered all assessments in the same classroom. Researchers concluded that instructors need not be “concerned about testing mode (computer versus paper) when designing and administering chemistry tests.”
When given the option of testing mode, some students may simply prefer the convenience of taking college-level exams online from their home. When fifth year medical students were given the opportunity to select a testing mode on an exam, researchers evaluated performance differences, the reason behind the choice of the format, and satisfaction with the choice (Hochlehnert et al., 2011). This study did not observe differences in performance based on testing mode. We hypothesize this may be due to the academic maturity of fifth year medical students who participated in the study, but could alternatively relate to the nature of the exam content. Additionally, students who elected to take online exams described their exams as clearer and more understandable.
In this paper, we use the online option in an organic chemistry course to investigate whether differences in grades are reflective of real differences in student performance or of other extrinsic and intrinsic factors that relate to preference for a testing mode. Specifically, we explore how testing mode preference might be related to constructs associated with motivation and engagement, which have been shown to relate to student performance in chemistry (Garcia, 1993; Black and Deci, 2000; Ferrell et al., 2016). We measured two motivation processes, intrinsic goal orientation and perceived value of chemistry (Pintrich et al., 1993). Intrinsic goal orientation is motivation that stems primarily from internal reasons (e.g., curiosity, wanting a challenge, or to master the content) (Pintrich et al., 1993). Task value, or perceived value of chemistry, is motivation to engage in academic activities because of the students’ beliefs about the utility, interest in, and importance of the disciplinary content (Pintrich, 1999). We selected these two distinct constructs because they reflect both intrinsic motivators (i.e., intrinsic goal orientation), such as the desire to develop deeper understanding, and extrinsic motivators (i.e., task value), such as beliefs that the subject material might be relevant to their future careers. If we find that, for example, students who prefer to take online tests have higher intrinsic goal orientation, then one strategy that might work well with online students is to embrace intrinsic factors that motivate them by, for example, encouraging instructors to teach through discovery or problem-based learning approaches. However, if we find that these students display higher level of task value, reflecting those extrinsic factors motivate the students at a higher level, then we may encourage instructors who teach online students to intentionally contextualize course material in real-world examples (e.g., Fahlman et al., 2015).
Another possible explanation for differences in performance between testing modes is their relation to student engagement, an essential part of the learning process (Coates, 2005; Chi and Wylie, 2014). Attempts to measure engagement in undergraduate STEM classrooms come in many forms, including participation and student behavior in the classroom (Sawada et al., 2002; Pritchard, 2008; Smith et al., 2013; Chi and Wylie, 2014; Eddy et al., 2015; Lane and Harris, 2015; Wiggins et al., 2017; McNeal et al., 2020), students’ reflections of their own cognitive and emotional engagement (Pritchard, 2008; Chi and Wylie, 2014; Wiggins et al., 2017), and even real-time measurements through the use of skin biosensors (McNeal et al., 2020). Similar to other measures in the current study, we quantified engagement because of its potential power in explaining performance disparities, and because of past research in the context of undergraduate STEM displaying its importance in academic success and performance (McNeal et al., 2020; Miltiadous et al., 2020).
We expected one of three outcomes: in one scenario, we do not observe testing mode effects. If we do observe a difference in performance, another scenario is that average exam scores are lower among students who elect a particular testing mode primarily due to self-selection effects, where less academically prepared and engaged students tend to prefer one testing mode. Alternatively, students that are equally prepared and engaged may perform at a lower level due to external factors related to a testing mode. To our knowledge, this is the first exploratory study to identify student preferences for testing modes in an undergraduate chemistry setting, and propose explanations for potential differences in performance due to testing modality. We analyzed data from two semesters of an organic chemistry class at a large southeastern university and addressed the following questions: (RQ1) Are there performance gaps between students who elect to take online tests and those who take face-to-face tests? (RQ2) Do these two groups differ with respect to other attributes, such as (a) intrinsic goal orientation, (b) perceived value of chemistry, or (c) incoming academic preparation? How do these attributes relate to performance overall? (RQ3) How does performance differ between students who report equal in-class engagement but selected different testing modes? (RQ4) Why do students prefer one testing mode over the other?
N = 305 | ||
---|---|---|
Online | Face-to-face | |
Participants | ||
Exam 1 | n = 143 | n = 126 |
Exam 2 | n = 143 | n = 127 |
Exam 3 | n = 161 | n = 108 |
Binary gender | ||
Women | 70.9% | 65.7% |
Men | 29.1% | 34.3% |
Class standing | ||
First year | 86.1% | 79.8% |
Second year | 5.6% | 11.4% |
Third year | 5.8% | 5.3% |
Fourth year | 1.8% | 1.9% |
Post-baccalaureate | 0.7% | 0.8% |
Graduate student | 0.0% | 0.8% |
Race/ethnicity | ||
Asian/Asian American | 3.8% | 3.6% |
Black/African American | 3.4% | 0.8% |
Latino/Hispanic American | 4.7% | 4.2% |
Native Hawaiian or Other Pacific Islander | 0.0% | 0.8% |
White/European American | 75.5% | 86.7% |
Other | 12.6% | 3.9% |
First-generation student? | ||
No | 73.4% | 81.4% |
Yes | 14.1% | 14.1% |
Unsure | 12.5% | 4.4% |
The classes in this study met via Zoom three times weekly. Those who took the class in summer met for 75 minute class sessions over 10 weeks, and those who took the class in the fall met for 50 minute class sessions over 15 weeks. Learning Assistants assisted these classes, resulting in a 15:
1 student to Learning Assistant ratio. The course design was a flipped classroom model, and in every class session students were randomly assigned to breakout rooms. The instructor uploaded pre-recorded lectures to a Learning Management System and students attended Zoom classes to work through Process Oriented Guided Inquiry Learning (POGIL) handouts and ask questions. For example, a typical class period would begin with 10–15 orientation slides/drawings/pictures followed by 5–10 minutes of general questions. Afterward, students went into breakout rooms to work on questions from the POGIL handout.
Exams typically covered 3–4 chapters of content and were designed to take about 10 minutes per page. Various question types included multiple choice, fill in the blank, or free response. At least 50% of all exam items tested students' ability to use, or interpret, bond line drawings to convey details of organic chemical structures or changes in chemical structure due to reactions. These molecular representations are a commonly used style to simplify complex molecules. Most of these questions are examples base on analogous reasoning (e.g., A + B → C, where either A, B, or C was missing). Each exam included an opportunity for extra credit, increasing the highest potential score to 120%. Students decided whether they took the online exam or face-to-face exam. The exams were identical in content and distributed at the same time (synchronously). Students who took online exams were proctored by the instructor (AC) and graduate teaching assistants via Zoom (i.e., no 3rd party proctoring service was used). All exams were close-note, so students were not able to use their notes or outside resources. Students who met face-to-face sat 6-feet apart (socially distanced) in a classroom.
The course used Gradescope to evaluate the exams. Students in both formats received equivalent question sets and answer keys. Both testing modes received an “Answer Sheet” one to two days before the exam with numbered blank boxes. On exam days, the students filled out the Answer Sheet. The online cohort was given 10–15 minutes after the exam ended to scan and upload their work. Students were familiar with this process and had experience uploading documents prior to examination. Students used various scanning methods (e.g., smart phone apps or desktop scanner). The face-to-face cohort had their exams scanned by the instructor (AC) and uploaded to Gradescope. A grading rubric was created by the instructor. AC assigned graduate teaching assistants to specific questions, and they graded that question for the entire class.
In the fall 2020 semester, we surveyed students to gain a better understanding of their decisions regarding the choice of testing mode, and to quantify other affective traits about the students that may differentiate the two groups. We measured intrinsic goal orientation (Pintrich, 1993) and perceived value of chemistry (Pintrich, 1993) using a 7-point Likert scale (e.g. strongly agree to strongly disagree; Table S1 in ESI†). We lightly modified scales and validated them on our student population through confirmatory factor analysis. To measure engagement, we also asked students for the percent of class time they felt intellectually engaged in learning the material, with options including less than 10%; 10–30%; 31–50%; 51–70%; over 70%. To collect survey responses, we administered a Qualtrics survey to students in the class during the last week of the semester, and offered a point in extra credit (a small fraction of their total course grade, awarded for clicking on the link to the survey).
Fit index | What is measured | Explanation | Intrinsic goal orientation4 (N = 198) | Value of chemistry6 (N = 198) |
---|---|---|---|---|
χ2 | Determines the magnitude of discrepancy between the covariance matrix estimated by the model and the observed covariance matrix of the data sets. | Should be non-significant, meaning the estimated covariates are not significantly different from the actual data covariates | 5.355 (p = 0.06) | 12.238 (p = 0.2) |
CFI | Determines if the model fits the data by comparing the χ2 of the model with the χ2 of the null model | >0.90 = acceptable | 0.964 | 0.993 |
Adjusts for sample size and number of variables | >0.95 = good | |||
RMSEA | Determines how well the model fit the data, and favor parsimony and a model with fewer parameters | <0.05 to 0.06 = good | 0.09 | 0.043 |
0.06 to 0.08 = acceptable | ||||
0.08 to 0.10 = mediocre | ||||
>0.10 = unacceptable | ||||
SRMR | A standardized square-root of the difference between the observed correlation and the predicted correlation | <0.05 = good | 0.03 | 0.029 |
0.05 to 0.08 = acceptable | ||||
0.08 to 0.10 = mediocre | ||||
>0.10 = unacceptable |
Intrinsic Goal Orientation = β0 + β1 Testing Modality + ε |
Perceived Value of Chemistry = β0 + β1 Testing Modality + ε |
Incoming Preparation = β0 + β1 Testing Modality + ε |
Performance = β0 + β1 Testing Modality + β2 Exam Number + β3 Intrinsic Goal Orientation + β4 Perceived Value of Chemistry + ε |
Lastly, to explore the effect of engagement on student performance, data were first subsetted by engagement level (<10%, 10–30%, 31–50%, 51–70%, >70%). Within each engagement category, one linear model was used to test for differences in student performance due to testing modality. Please note, only pairwise analyses were run within each engagement category, and not longitudinally across categories due to constrained sample size. For example, the model for reported engagement category of <10% would be:
Engagement < 10% = β0 + β1 Testing Modality + ε
All statistical analyses were performed in R version 4.0.3. For quantitative analysis of RQ1–RQ3, we ran repeated measures linear mixed-effect (LME) models using the nlme package (Pinheiro et al., 2020). To account for repeated measures from a single student, we included Student ID as a random effect variable. In measures of performance, we also included incoming preparation as a random effect as it significantly impacted performance outcomes (F(1,173) = 57.823, p < 0.0001).
When appropriate, we used the emmeans package (Lenth, 2019) to obtain post hoc pairwise significance, utilizing Tukey post hoc p-value adjustments. All independent correlational measures are based on Pearson's Correlation Coefficients. Statistical significance was based on p < 0.05 and confidence intervals that exclude zero.
An individual student response was coded into multiple themes when appropriate. In other words, a single student's response may fit into multiple thematic codes. We calculated the frequency of response within each theme and separate testing modalities by dividing the number of responses for a specific category and dividing it by the total data points gathered for one modality. This was then repeated for the second testing modality.
In the broader literature, the effect of testing mode on student outcomes is mixed, with some studies showing nonsignificant differences between computer-based and paper-based test results (Horkay et al., 2006; Wang et al., 2007, 2008; Tsai and Shin, 2013; Meyer et al., 2016) and other studies showing significant differences (Clariana and Wallace, 2002; Bennett et al., 2008; Keng et al., 2008; Backes and Cowan, 2019). In the following results and discussion, we summarize our findings, place them in the context of previous work, and when possible, make recommendations for future research or instructional practices.
Contrary to our predictions, the only difference between students of the two testing modes, other than test performance outcomes, was their responses to survey questions that gauged chemistry task value. The relationship between perceived value of chemistry and testing mode was statistically significant (F(1,394) = 4.393, p = 0.037), such that students who chose to take the exam in a face-to-face format reported a higher value of chemistry (Fig. 2B). Task value is the perceived value attributed to a task (in this case chemistry) or the reported utility and importance of the disciplinary content. Students' motivation to learn and perform may be dependent upon, in part, the value they attribute to the task, and previous work has demonstrated its predictive relation to performance outcomes (Bong, 2001; Joo et al., 2013; Robinson et al., 2019). One explanation for our results may be that students who chose to take face-to-face exams did so, to some extent, based on how important they perceived organic chemistry, which in turn impacted their performance on assessments.
However, we observed a positive correlation between performance outcomes and these affective measures as well as incoming academic preparation. Specifically, we found that intrinsic goal orientation (F(1,173) = 43.36, p < 0.001), perceived value of chemistry (F(1,173) = 10.23, p = 0.001), and incoming preparation (F(1,173) = 57.82, p < 0.001) significantly impacted student performance regardless of testing mode. When we ran correlational analyses of each measure independently, we found each measure was positively related to student performance (intrinsic goal orientation: r = 0.28, p <0.001; perceived chemistry values: r = 0.23, p < 0.001; Incoming preparation: r = 0.38, p < 0.001; Fig. 2A). In other words, these factors correlated with academic performance for all students, but not based on testing mode preference.
In other words, among students who reported being intellectually engaged in learning the material for over 50% of class time, those who chose to take the exams face-to-face performed significantly better than students who reported equivalent engagement levels but took the exam online. We expected student engagement and measures of performance to be closely linked, regardless of testing mode (Coates, 2005). We suggest one of three explanations of our results. In one scenario, despite similar levels of reported engagement, other affective factors inherent to the student lead to underperformance during online exams (such as reported value of the material, as described above, or an unexplored variable). Another possibility may be that online exams directly disadvantage students who are otherwise equally prepared and engaged. Despite the importance of student engagement in evidence-based teaching and learning (National Research Council, 2012), it is critical that assessments of students are reflecting the content students have learned. And assessing students in two different ways (computer-based or paper-based) in two different locations (remote or in the classroom) may result in the appearance of lower understanding of material, when in fact online students experience the assessment differently, leading to lower scores despite similar knowledge. Previous researchers pointed out that factors such as screen size, font size, and resolution of graphics have the potential to enhance the experience of taking online assessments (McKee and Levinson, 1990). While our results do not support this previous work, we agree that the experience of taking an online exam is fundamentally different from a face-to-face exam, which may have led to lower grades among our online testers. A third possibility is that our single-item measure of engagement, in addition to low sample sizes across some engagement categories, is not sufficient to draw conclusions at this stage; yet, we point to the possibility of a relationship between these testing modes and engagement, and hope future work pursues this open question.
![]() | ||
Fig. 4 Themes resulting from coded open-ended survey items including the frequency of response, a description of the theme, and examples extracted from student responses. |
We received 199 surveys from students who reported taking most exams face-to-face. After removing 104 surveys due to non-response, we had a total of 95 responses, which we binned into 9 categories, leading to a total of 167 data points. The four most common responses for the face-to-face data (Fig. 4) were: Increased performance and focus (20%), Increased comfort and decreased stress (17%), Preference for classroom environment (17%), and Avoidance of technical issues (16%).
Regardless of whether a student chose online or face-to-face testing modes, they were likely to mention comfort and convenience as a reason for their preference, whether that is through decreased anxiety, exam format, or preference for a specific environment. While nearly all the top responses for online testing modalities are related to comfort, with the notable exception of COVID-19 risk factors, students who chose face-to-face modalities mention factors of convenience and classroom success. Bringing into question: how do students identify what is “convenient, comfortable, and important” within a learning environment? For example, future work should focus on understanding how students classify convenience and comfort to illuminate why students choose their preferred testing modalities, and to determine if underlying personality traits inherent to those decisions play a role in the performance disparity.
Many students across both testing modes mentioned stress or anxiety associated with the exams in their open responses. Test anxiety can be characterized by the negative cognitive or emotional reactions to perceived and actual stress from fear of failure (Zeidner, 1998). Test anxiety is pervasive across large foundational STEM courses such as organic chemistry, where exam grades account for the majority of the student's final score, and previous work shows this disproportionately impacts women (Ballen et al., 2017; Salehi et al., 2019b). To our knowledge, research has not addressed test anxiety during online exams, or compared the relative impact of online and face-to-face exams on anxiety, but this is a potential area of future exploration.
Differential performance may be due to alternative explanations only vaguely explored within this study. Differential performance may be explained by (1) measurable advantages of the in-person classroom testing environment unnoticed and unreported by students in this study, or (2) unmeasured advantages of ritualistic behaviors exhibited by students traveling to a classroom for in-person examination. Previous research has shown that ritualistic behaviors such as test-taking routines and factors such as the use of professional attire can increase student performance (Adam and Galinsky, 2012). This can be due to the increased perception of professional expectations by students who chose to dress professionally, or through the introduction of a routine specific to test taking. In another example, previous work showed that chewing gum for up to 5 minutes before an examination may increase student performance (Onyper et al., 2011). Future work will profit from exploring the ritualistic act of relocating from a home environment to a classroom environment as a key part of the testing preparation routine, transitioning to a test-taking mindset, and establishing boundaries between a relaxed state and a professional state for increased test performance.
As this study was completed in an exploratory nature, there are additional potential influences on student outcome. For example, while the instructor has extensive experience designing and teaching chemistry online, online exam opportunities for organic chemistry was a newly offered option. By design, both exam formats required students to complete identical answer sheets and online students then uploaded responses to be graded. Students who took the test on the online format had previous experience uploading activity sheets and were familiar with the process, denoting that neither testing format exposed students to new challenges on exam day. While there were no reportable issues as the online option continued through the semester and there are no planned adaptations to implementation in the future, it is possible that future iterations could lead to changes in student outcomes and perceptions following adjustments.
Future research will delve into whether these results were due to the use of a computer or the physical space in which students took exams, and how these experiences led to lower performance and value they placed on chemistry as a field. However, taken together, our results were consistent and clear, and reflect the divergent experiences of students who must decide how they approach assessments. We hope our research can serve as a foundation for future questions that tease apart impacts of task value and testing mode preference on student performance.
Surprisingly, our data suggested that this relationship was not associated with incoming preparation, student reported engagement levels, or measures of intrinsic goal orientation. Our exploratory results support that a primary difference between test mode preferences is perceived value of chemistry. However, it is also possible that the relationship is due to innate aspects of classroom environment unrealized by students, or the mentality of test taking itself.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d1rp00324k |
This journal is © The Royal Society of Chemistry 2022 |