DOI: 10.1039/C9RP00077A
(Paper)
Chem. Educ. Res. Pract., 2019, Advance Article

Vickie M. Williamson*^{a},
Deborah Rush Walker^{b},
Eric Chuu^{c},
Susan Broadway^{d},
Blain Mamiya^{e},
Cynthia B. Powell^{f},
G. Robert Shelton^{g},
Rebecca Weber^{d},
Alan R. Dabney^{c} and
Diana Mason^{d}
^{a}Department of Chemistry, Texas A&M University, 77843-3255, USA. E-mail: williamson@tamu.edu
^{b}Department of Chemistry, The University of Texas at Austin, USA. E-mail: drwalker@cm.utexas.edu
^{c}Department of Statistics, Texas A&M University, TX, USA. E-mail: ericchuu@tamu.edu; adabney@stat.tamu.edu
^{d}Department of Chemistry, University of North Texas, USA. E-mail: Susan.Broadway@unt.edu; Rebecca.Weber@unt.edu; Diana.Mason@unt.edu
^{e}Department of Chemistry and Biochemistry, Texas State University, USA. E-mail: bmm172@txstate.edu
^{f}Department of Chemistry and Biochemistry, Abilene Christian University, USA. E-mail: powellc@acu.edu
^{g}Department of Science and Mathematics, Texas A&M University-San Antonio, USA. E-mail: Bob.Shelton@tamusa.edu

Received
14th March 2019
, Accepted 24th June 2019

First published on 25th June 2019

First-semester general chemistry is a known “gatekeeper” course due to its high failure rate. These higher education courses are taken by students who for the most part are regularly admitted freshmen, yet many struggle to succeed. In this investigation researchers from six higher-education institutions of varied sizes with student bodies of different ethnic composition teamed up to investigate the Math-Up Skills Test (MUST) as a potential tool to identify at-risk students in first-semester general chemistry (Chem I). A large population (N = 1073) was given the MUST at the beginning of the semester. The MUST had good internal consistency with the sample (Cronbach's alpha = 0.856). The MUST offers a quick 15 minute assessment of basic mathematics ability without a calculator. Instructors may find it easier than other documented predictors, which may take more time or involve obtaining student records (e.g., Math SAT). Results from the MUST support prior findings that mathematics skills correlate with course grades in chemistry. Poor arithmetic automaticity may be an underlying factor in lower performance by many students. With statistical modeling, the MUST, along with other demographic variables, was able to identify students who would go on to make a 69.5% or better in Chem I with a high percent of accuracy. The MUST, in tandem with other demographic variables, should be considered a potential tool for early identification of students at-risk for failing the class.

Student employment is also a characteristic that has been explored with the idea that simple lack of time might signal potential difficulties. Researchers remain divided on student employment and its impact on student grades. King (2006) found positive effects for those with low work hours at on-campus jobs related to the student's academic interests. Dundes and Marx (2006) suggested that there may be an optimal number of hours that is beneficial, since in their study, students who worked 10–19 hours earned higher grades than those who worked more or less hours, including those that did not work. On the other hand, Lammers et al. (2001) found that the level of academic achievement was lower for all working students.

In addition to gender, ethnicity, and employment, there seems to be a link between student success in college and the education level of the parents. For example, Sirin (2005) in a meta-review noted that student performance and dropout rates found significant effect sizes with level of parental education. Students with parents, neither of whom received a 4 year college degree (first-generation college students), performed more poorly and had higher dropout rates than students with one parent with a 4 year degree. Snibbe and Markus (2005) argued that parental education is a proxy for social class or socioeconomic status (SES), meaning that first-generation students were more likely to come from working class backgrounds and face significant economic barriers in college.

Another line of research has focused on diagnostic tests to signal potential student success or difficulty. In fact interest in early identification of struggling students in general chemistry courses has increased in recent years. Cooper and Pearson (2012) referred to several diagnostic instruments that have been used to identify potential at-risk students. These included the Toledo Chemistry Placement Exam (Hovey and Krohn, 1963), the Group Assessment of Logical Thinking (GALT) (Bunce and Hutchinson, 1993), the Test of Logical Thinking (TOLT) (Tobin and Capie, 1981), the California Chemistry Diagnostic Test (Karpp, 1995), the University of Iowa Placement Exam (Pienta, 2003) and the Math SAT (Spencer, 1996). A common component of these instruments is mathematics proficiency. Both the TOLT and the GALT assess logical-reasoning ability and require mathematics in items on proportions, probability, and correlations. These tests have been linked to achievement in chemistry by a number of researchers (e.g., Williamson et al., 2017). Lewis and Lewis (2007) found that both the TOLT and the SAT could be used to successfully identify at-risk students as have other researchers (Andrews and Andrews, 1979; Mason and Verdel, 2001). Stone et al. (2018) used the ACT at a liberal arts institution to predict success. Pyburn et al. (2013) found that language comprehension, prior knowledge, and mathematics ability, as measured by the Math SAT, significantly contributed to chemistry performance. Remaining predictors of success for students in general chemistry mentioned in the Lewis and Lewis (2007) paper include high school GPA, ACT scores, personality characteristics, and content-specific diagnostic exams. Gerlach et al. (2014) found that scale literacy was the best predictor of success on conceptual final examination scores for general chemistry students, building on the work of others (e.g., Jones and Taylor, 2009). Scale literacy relates to proportion and quantity. Jones and Taylor (2009) outlined a trajectory of scale development, with automaticity and accuracy being a characteristic of experienced scale development. Ralph and Lewis (2018) found that while at-risk students, with low Math SAT scores, struggled with all topics in general chemistry, the mole concept and stoichiometry were especially difficult for these students. These concepts are very mathematical in nature.

The idea that chemistry grades correlate with proficiency in mathematics is not new (Weisman, 1981; Spencer, 1996). Basic arithmetic skills and the automaticity of these skills have been investigated by a number of researchers with some suggesting interventions. For example, Bohning (1982) used an 84-item test, which was mailed to students with instructions not to use a calculator. Students were informed of the results and could enroll in a concurrent, supplemental course that focused on a formal, intensive review of the mathematics concepts involved on the test. A high percentage of those who took the supplemental course passed the chemistry course. Peer learning was used in chemistry laboratory by comparing the effects of grouping students of differing mathematical abilities compared to students who self-selected into a group (Srougi and Miller, 2018). These authors found that both mathematics abilities and attitudes towards working with others increased for those grouped by differing levels of mathematics ability. The 13-item mathematics instrument was given at the first and last lab and allowed calculator use. Topics included chemistry-oriented mathematics concepts, including basic operations, fractions, exponents, percentages, scientific notation, unit conversions, significant figures, graphing, and algebra. Indeed, even though many mathematics-preparatory modules and activities exist to help instructors bolster their students’ mathematics knowledge (e.g., Dahm and Nelson, 2013), students can still have issues with mathematics.

Johnstone (2000) outlined curricular changes to help students better succeed with chemistry, using his model of the nature of chemistry, which is composed of macroscopic, submicroscopic or particulate, and symbolic levels. It is this symbolic level that includes symbols, formula, equations, and mathematical treatments. Johnstone believed that students should be moved through these levels during instruction, but that introducing all levels simultaneously is a ‘sure recipe for overloading Working Space’ (Johnstone, 2000, p. 11). According to Johnstone, as students become more expert-like, they are able to easily move between the levels. These ideas fit well with the constructivist view of learning. Williamson (2008, p. 68) provided a definition of constructivism, based on the work of Piaget (1977); von Glasersfeld (1995); Osborne and Wittrock (1983); Bodner (1986):

“Constructivism is the belief that:

(a) knowledge is constructed from interactions with people and materials, not transmitted,

(b) prior knowledge impacts learning,

(c) learning, especially initial understanding, is context specific, and

(d) purposeful learning activities are required to facilitate the construction or modification of knowledge structures.”

Students who have constructed knowledge structures concerning basic mathematics should be able to move between these ideas or to have automaticity with basic mathematics skills.

Currently, faculty at many universities are responding to pressures to increase retention in course enrollments and are looking for ways to identify at-risk students (e.g., Hanover Research, 2018). In all, chemistry instructors are well aware that some proficiency in mathematics is an important prerequisite for success in college chemistry. Even with the studies previously discussed, some instructors identify a lack of easy access to instruments appropriate to measure prior mathematics knowledge as problematic. Lewis and Lewis (2007) noted that the TOLT requires a lot of class time and not all instructors have easy access to students' SAT scores. They suggested that the ease of giving the TOLT makes it a good option, especially when the SAT scores are not readily available. Other issues include rectifying the time from lecture to administer a diagnostic instrument. The California Chemistry Diagnostic Test and the Toledo Chemistry Placement Exam are time consuming, taking about 45 minutes and 55–60 minutes, respectively. One shorter, calculator-free mathematics test, described by Leopold and Edgar (2008), offers good predictability in second-semester introductory chemistry and is multiple-choice. This instrument takes 30 minutes to administer and requires 10 possible responses on the scan form for 17 of the 20 questions, making it a difficult choice for instructors limited to a standard five-response forms, have limited class time, or are hesitant to give up 30 of the standard 50 minutes in a class period.

Some have suggested that diagnostic tests should be given online or outside of class time to eliminate the drain on classroom time. Bohning (1982), as previously discussed, mailed the diagnostic to students to be completed at home that was not multiple-choice, with instructions to not use a calculator. The Bohning study did not use this diagnostic test to predict performance in the course, but to advise students whether they should take a supplemental mathematics course. Some instructors question the validity of tests taken in a non-proctored environment. Williamson et al. (2017) tested the differences between giving reasoning ability and spatial ability instruments in the classroom or as an online, un-proctored assignment. They used the TOLT for reasoning ability and used a battery of standard spatial ability tests, looking at rotation of three-dimensional objects and visuospatial transformation ability. The authors found that on items that asked students to give all possible combinations, students taking the test online failed to supply as many combinations as those in the classroom. These authors concluded that when either the TOLT or spatial ability tests were given online as compared to in the classroom, the scores will vary slightly, favoring the in-class versions, but the online versions still had a similar grade prediction value, making online administration a viable option. Regardless, many instructors prefer the reassurance of in-class administration.

In a recent publication, Albaladejo et al. (2018) noted that students can perform basic arithmetic functions with the aid of a calculator better than without on a short, open-response instrument named by the last author as the Math-Up Skills Test (MUST). While this is not surprising, a further finding was that students who enter college chemistry courses lacking automaticity in arithmetic skills needed to succeed may limit their understanding and further contribute to underling difficulties. The Albaladejo et al. paper reported on a 16-item mathematics diagnostic instrument adapted from that used by Hartman and Nelson (2016). The 16-item MUST was used to identify whether the most basic mathematics knowledge, numeracy, is lacking in students enrolled in first-semester general chemistry and whether their numeracy ability relates to their course averages. The 16-item MUST included items on: multiplication, multiplication and division with powers of ten, zeroth power application, changing fraction to decimal notation, rearranging algebraic equation (combined gas law), logarithms, determining the base-10 logarithm functions, square and square root of a number in scientific notation with a negative power, and balancing simple chemical equations. This pilot study showed that the MUST scores earned without the use of a calculator had a greater correlation with students' final course averages in first-semester general chemistry (Chem I) than the MUST scores earned with the use of a calculator (Albaladejo et al., 2018).

After the pilot study, the MUST was expanded to 20 items, and two versions of the 20-item MUST are included in the supplemental material for the Albaladejo et al. (2018) paper and in the Appendix 1, ESI† of this paper. The solutions for the items are in Appendix 2, ESI†. Added items agreed to by a committee of general chemistry instructors included simplification of a complex fraction, division by zero, simplification of a mixed operations fraction, and calculation of fraction-decimal equivalents. These additional items provide greater variation on fractions than the original MUST questions. Thus, the question that motivated the current research is how does the updated version reflect the prior knowledge of students enrolled in Chem I and can this updated version be used to accurately predict grades in first-semester general chemistry.

1. Does a diagnostic arithmetic instrument (the MUST: Math-Up Skills Test) predict students' course averages in first-semester general chemistry (Chem I)?

2. What other demographic variables will improve the predictability of the MUST?

3. To what extent does the MUST predict which students will have satisfactory course averages (69.5% or higher)?

In the fall of 2017, collaborators distributed the 20-item MUST to Chem I students to investigate the predictive power of the MUST on course grade. Each student completed the MUST without a calculator at the beginning of the semester. Demographics were collected from a self-reported survey that included university classification, gender, ethnicity, major, college-degree attainment of parents and grandparents, and employment for each student. Students were invited to participate in the study from 13 sections of Chem I classes, which ranged in size from 30 to 305 students in each section and were across the six institutions with 10 instructors. It should be noted that these fall Chem I classes represented students taking the course on-sequence; traditionally Chem II classes are taught in the spring. Of the 1446 students who were invited, the sample size was reduced for students who did not sign their IRB-release forms at each university or who omitted significant demographic information. A total of 1127 students consented to be in the study, but 54 had missing data. The resulting student sample consisted of 1073 students who completed all study materials and gave their consent. A number of institutions required that the consent form be given late in the semester, some at the final examination, meaning that students who withdrew from the course or chose to skip the final exam could not be included.

The Chem I classes were taught by ten experienced instructors with between 1.5 to over 40 years of experience teaching Chem I, who all have an interest in chemistry education. Classes at five of the six institutions were taught for three contact hours per week in face-to-face sessions, which resulted in two or three meetings per week. Classes at one institution (school 6) had five hours of contact time per week. The classes used various textbooks and homework systems. Instructors calculated the course grade in different manners but reported the number of total points the student received and the total number of points possible in the course. For the purposes of this study, the course average for each student was calculated by dividing the points earned by points possible, then the score was reported as a percentage. The goal of this work was to investigate how the MUST predicted performance across a broad geographic setting far beyond what happens at a single institution.

Demographics were collected in a self-reported survey that was given with the MUST. These included university classification, gender, ethnicity, major, college-degree attainment of parents and grandparents, and employment for each student. Grandparent college-degree attainment was included, because a number of this generation has significant grandparent influence in their daily lives (Monserud, 2011). These variables were collected since they had been reported in the literature as predictors of success. It should be noted that the goal of predictive research, which seeks to produce the best possible predictive model, differs from the goal of explanatory research, which seeks to explain the phenomena (Pedhazur, 1997, p. 196). Our goal was to seek a predictive model.

• student classification (freshman, sophomore, junior, or senior),

• gender (male, female, or no response),

• race/ethnicity (Asian, Black, Hispanic, White, Mixed, or Other),

• major (STEM, medical, dual major, or other),

• which of the two MUST questionnaire versions was completed (version 78 or 87),

• whether their parents graduated from college (don’t know, no, or yes),

• whether their grandparents graduated from college (don’t know, no, or yes),

• whether they are employed on-campus (no or yes),

• whether they are employed off-campus (no or yes), and

• number of employment hours per week (0, 1–10, 11–19, 20–29, 30–39, or 40+).

We also included fixed effects in both models to account for differences between the participating universities (schools 1–6). While random effects for university differences would also be an option, they complicate the models and the methods used to tune them, and we are interested in prediction rather than explanation of relationships to enhance our ability to make accurate predictions without the random effects.

The 2017 data set was split to prepare and test the statistical model. This involved randomly selecting 2/3 of the students from each university's cohort to be used for developing, or training, the models, with the remaining 1/3 being held out for testing the models' accuracies (James et al., 2013). The LASSO method (least absolute shrinkage and selection operator) is a regression analysis method that was used to regularize, or smooth or shrink, the models in an effort to find the set of model coefficients that optimize prediction accuracies in balance with predictive effects for subject covariate variables (James et al., 2013). The LASSO shrinks the regression coefficients toward zero, with greater shrinkage applied to the coefficients for predictor variables that do not contribute substantially to the model's predictive accuracy. It is also possible for the LASSO to shrink model coefficients all the way to zero, meaning that the LASSO can also be viewed as a variable selection tool. The use of the LASSO statistical method automatically filters out variables that are not needed for making predictions.

To guide the tuning and evaluation of the predictive models for course average, we used mean squared error (MSE). MSE is an average of the squared values of the differences between true course averages and those predicted by the model. As such, the MSE will always be non-negative, and models with smaller MSEs (higher predictive accuracy) can be selected over those with larger MSEs (Sheather, 2009). To guide the tuning and evaluation of the predictive model for course success, we used balanced accuracy. Balanced accuracy is the average of sensitivity (the proportion of success cases that were correctly predicted to be successes) and specificity (the proportion of true fail cases that were correctly predicted to be failures) so that balanced accuracy is a number between 0 and 1 for which numbers closer to 1 indicate greater predictive accuracy (James et al., 2013). By averaging sensitivity and specificity, we can be confident that selected models do relatively well in making predictions of course success both for students who do indeed succeed and students who do not. All modeling was done with the R statistical programming language and the R package glmnet (Friedman et al., 2010). Our R code and data are available upon request and can be used to reproduce the results. We will also give an option for others to use our code with their data at the end of the results section.

Items on the open-response MUSTs were scored on a binary scale as either correct (1) or incorrect (0). Each student's MUST score was a sum of the correct items on a scale 0–20. Descriptive results gave a mean, standard deviation (SD), and standard error (SE) of the MUST score for the total group of students as 10.36 (SD = 4.93) (SE = 0.15). The average for all students represents a score of 52.0% on the MUST. A t-test between participant scores on the two versions of the MUST indicated no significant difference between versions (p = 0.794). This similarity was expected because the form of the question did not change between the two versions, only the exact number changed (e.g., evaluating 42° versus 47°). The 20-item MUST gave an internal consistency reliability of Cronbach's alpha = 0.856 with our population, which was above the 0.70 satisfactory level (Nunnally and Bernstein, 1994). Fig. 1 shows the frequency of MUST scores with our sample of 1073.

The results for each MUST question by institution were plotted (Fig. 2). Some questions were expectedly more difficult for the students as a whole. Looking beyond the individual means, similar patterns of success and difficulty were observed. Questions that were difficult for students at one institution were also difficult for students at all institutions even though some institutions consistently outperformed others. We investigate how the MUST and its questions perform across the six institutions in the statistical modeling section.

Success in a course typically spans grades of A, B, or C, whereas unsuccessful grades include D, F, or W (withdrawal). The course average for each student was used as the dependent variable. The course average was calculated by dividing the course points earned by each student, divided by the total possible points, then multiplied by 100. Students with course averages of 69.5% or higher were considered successful students, while those with course averages of 69.4% or lower were considered unsuccessful. Table 1 shows these results. As expected, successful students had higher MUST scores. The MUST scores are statistically different (p < 0.05 with a 2-tailed t-test) with successful students outperforming those who were not successful. The effect size using Hedge's for unequal groups = 1.08. This is a high effect size, since according to Cohen (1992) <0.2 is negligible, 0.2 to <0.5 is small, 0.5 to <0.8 is medium, and 0.8 and above is large. The smaller the SE calculation for a group (SE = SD/(n^{1/2})) would indicate a more homogeneous total group or a large sample size of the group. For example: successful students comprised 80.2% of the total group, so their SE was lower.

Chem I | n = 1073 (%) | MUST (SD) (SE) |
---|---|---|

a Successful students significantly higher at p < 0.05 level. | ||

Successful (69.5% or above) | 861 (80.2%) | 11.33 (4.64) (0.16)^{a} |

Unsuccessful (69.4% or below) | 212 (19.8%) | 6.43 (4.05) (0.28) |

Further exploring relationships between the MUST score and course performance, the data were divided into quartiles for the MUST scores. The MUST quartiles were from 0–7, 8–11, 12–14, and 15–20 for quartiles 1–4; these have populations of 334, 275, 211, and 253, respectively. The quartiles were compared to the course grade with an alluvial diagram in Fig. 3. An alluvial diagram requires two categorical variables. The student's course average was converted into a letter grade using 69.5 to 79.4 as a C, etc. In Fig. 3, we used the student's MUST quartile versus the student's calculated letter grade. Here, students with MUST scores in the top quartile made mainly A's and B's, while those in the bottom quartile mainly made low grades. It should be noted that 30 students from the bottom quartile did make an A, but this was less than 3% of those in that quartile. While the alluvial diagram does give an interesting visual of the MUST and course performance relationship, we tried a few ineffectual models that are included in the Appendix 4, ESI† (like a linear regression from a graph of each MUST score versus the mean course average for that score, but this overfit the data by extracting the variation to 21 data points for MUST scores 0–20 and a linear regression from a graph of MUST score versus the mean course average for all 1073 points, but this showed that a simple linear relationship between the two did not accurately predict course averages). To meet our goal of a better prediction model, we turned to formal statistical modeling.

Two regression models were performed on the 20-item MUST with the LASSO method described in the methodology using a random 2/3 of the data from each university. The first model is a linear regression for the Chem I course average. All the predictor demographic variables are categorical, compared to the numeric MUST score. With the regression model of the numeric Chem I course average, the estimated model coefficients for different levels of categorical variables are mean differences comparing to the variables' baseline levels, holding all other variables constant. In the linear regression model, the estimated coefficient for the numeric MUST variable is the mean change in course average associated with a one-point increase on MUST, again holding all other variables constant.

The second model was a logistic regression for being successful (69.5 or better average) versus unsuccessful in the course, all the predictor demographic variables are again categorical, compared to the numeric MUST score. With the logistic regression model of binary course success (course average at least 69.5) or not, a similar interpretation of coefficients applies, but with mean differences replaced by log odds ratios.

The linear regression and logistic regression models were developed using a random 2/3 of the data from each university. The predictive accuracy of these models was assessed by using them to make predictions on the remaining 1/3 data and comparing the prediction to each student's actual course average. The predictions resulted in an estimated mean squared error (MSE) of 110 for the linear regression model. While the MSE estimate is difficult to contextualize on its own, we note that the MSE of a “null” model that has only an intercept and the MUST, but no other covariate terms, was 182.8. Note that a lower MSE is better. Thus, using our predictor variables along with MUST corresponds to a substantial improvement in model fit (p-value for comparing our model to the null model was approximately 0). Thus, the MUST is an effective feature that can be successfully used to predict course averages in Chem I with the predictor, demographic variables.

For determining success and failure in the course, the logistic regression model with LASSO achieved an estimated balanced accuracy of 72.4% (the average of the correct predictions of both the successful and unsuccessful students) and an overall classification accuracy of 78% (meaning that 78% of all predictions among the test data were correct). For many, it is more important to predict success versus failure than the course average. Thus, the MUST with the demographic variables can successfully be used to predict success/failure in Chem I.

In order to see if the 20-item MUST was superior to the 16-item MUST for prediction of course average, the linear regression for MUST score versus course average was run for both forms of the MUST. The new questions were numbers 6, 16, 17, and 18 of the 20-item MUST so running the analysis without those questions closely replicates the 16-item MUST. We found an MSE = 112 for the model generated using the old 16-item MUST versus MSE = 110 for the model using the 20-item MUST. This indicates that the 20-item MUST model with the smaller MSE is better, since it had the smaller average of the squared differences between the true course averages and the actual course average for the 1/3 of our sample that was not used to develop the models. Testing for significance of the nested models gave a p-value of 0.0226, indicating that the four additional questions in the 20-question MUST jointly account for enough variance that we can reject the null hypothesis that the coefficients for the four additional questions are equal to zero. Thus, the 20-item MUST was preferred for predicting course averages. Comparing the two forms of the MUST on the logistic regression of binary course success or not, the two forms gave comparable accuracy, with no preference for either form. Since we wanted to be able to both predict course average and course success, all further analyses were done with the 20-item MUST.

The different categories of questions on the MUST were explored to see whether certain types of questions were more or less important to our models. Questions on the 20-item MUST were grouped into five question categories: multiplication (questions 1, 2, and 3), division (questions 4, 6, 7, 8, and 16), fraction simplification (questions 9, 10, 17 and 18), logarithms and exponents (questions 5, 12, 13, 14, 15), and symbolic manipulation (questions 11, 19, and 20). For both the linear and logistic models, all categories were needed. The previous analyses indicated that all items of the 20-item MUST should be included in the models.

The coefficients of our final linear regression and logistic regression models are shown in Appendices 5–8, ESI.† The p-values for the other variables can be compared to that for MUST in Appendices 5 and 7, ESI,† showing that MUST is by far the most important contributor to the ability to predict course average. While not statistically significant, a number of variables still contributed to prediction accuracy. A predictive model can be improved by adding a variable even if that variable's contribution is not statistically significant. For completeness, the coefficients for the linear regression model and logistic regression models are included in Appendices 5 and 7, ESI,† respectively. However, for the examples below and in the discussion of the results, we use the coefficients obtained from the LASSO regression models for course average and classification of success/failure that are found in Appendices 6 and 8, ESI.† The models feature the students’ institution, year in school, major, gender, ethnicity, parents’ graduation, grandparents’ graduation, employment status (on/off campus), hours worked, MUST exam version, and MUST score. Most of the predictor variables are categorical variables, with numeric values for the MUST. The model coefficients can be interpreted as, with all other predictor variables held constant, differences in average course grade (for the linear regression model) or log odds (for the logistic model). The coefficients for the categorical predictor variables compare to their baseline levels, while those for the numeric variables compare two individuals that differ by one unit. For each categorical variable with k categories, k − 1 coefficients will be reported, with the missing category accounted for in the intercept. For example, the freshman year does not have an estimate or coefficient in either Appendices 6 or 8, ESI,† meaning that the contribution from a student being in his/her freshman year is accounted for in the intercept.

As one example, consider a senior, white, female student at school 4 in a medical-related major, who worked 40 hours a week on campus, had both parents and grandparents with college degrees, and scored 10 out of 20 on version 87 of the MUST. Using the coefficients from the LASSO regression model on Appendix 6, ESI†, we can compute the following predicted course average for such a student:

69.3116 | +3.0867 | +0.2625 | +1.9629 | +0.0687 | −11.1646 | +(10 × 1.0733) | = | 74.2608 |

Intercept | Senior | Female | Major med | White | Work 40 h | MUST score × coeff. | Predicted average |

The complete set of coefficient estimates can be found in Appendix 6, ESI.† In general, one can make a prediction based on our model by plugging in values for all the demographic variables in our model. Although the MSE obtained from the LASSO model does not differ substantially from the regression model with all of the features, the simplified LASSO model provides easier interpretation and allows for identification of features that are more important due to the built-in variable selection.

Similarly, the classification model can be used to compute an estimated log odds of course success. The complete set of coefficient estimates from the LASSO model can be found in Appendix 8, ESI.† For our sample student, the classification model would give a predicted log odds of success equal to −0.6997, which was computed using:

−0.2423 | −0.4305 | +0.5858 | +0.0803 | +0.0914 | +0.2466 | +0.2993 | −0.0295 | −3.0628 | +(10 × 0.1762) |

Intercept | School 4 | Senior | Female | White | Major med | Grands graduates | Job on campus | Work 40 h | MUST score × coeff. |

This corresponds to an estimated probability of exp(−0.6997)/(1 + exp(−0.6997)) = 0.3319 that the student will succeed in the course. The MUST coefficients in both models are highly statistically significantly different from 0, much more so than any of the coefficients for categorical variables, indicating that MUST score greatly influences predictions.

School | Description | Undergraduate enrollment F2017 | Admitted student SAT Math F2016 |
---|---|---|---|

1 | Private | 4427 | 515–600 |

2 | Public, Hispanic >25% | 6470 | 380–490 |

3 | R1 | 62915 | 550–670 |

4 | R2, Hispanic >25% | 38694 | 470–560 |

5 | R1 | 38094 | 490–600 |

6 | R1 | 51427 | 580–730 |

While we have provided instructions in the previous paragraphs for readers to plug their data into our fitted models in order to make predictions for their students, we recommend that interested readers fit our linear and logistic regression models to their own institution's data and use those personalized models for predictions for their students. The models presented in this paper are somewhat limited due to their only using data from our handful of universities, and these universities are not expected to be sufficiently representative of the entire population of all global universities. Our instructions in the preceding two paragraphs can be used to make predictions for students at a university that was not part of our collaboration but using one's own university's student data for model fitting would be expected to result in more relevant predictions. Interested readers can download our code and replace our data file used for model fitting with an analogous data file, with the same exact variable names and types as ours, then run our code to obtain custom predictions on their own students. Our code is hosted online in the form of an R script file, with a short pdf document describing how to put in one's own data. These can be found at this Github address: https://github.com/echuu/ChemEd.

We believe that predicting performance is worthwhile; however, our model did not investigate causation, only relationships or associations. Some researchers have found that concurrent courses (e.g., Bohning, 1982), requiring prerequisite courses (e.g., Stone et al., 2018), or special programs like STEM learning communities (e.g., Graham et al., 2013) have helped improve course grades for those at-risk. Our goal was to find an easy way to predict grades or success in general chemistry I.

Compared to other measures of mathematics proficiency, the MUST offers several benefits. Compared to the Toledo Chemistry Placement Exam, the TOLT, the GALT, and the California Chemistry Diagnostic Test, the MUST requires less class time at only 15 minutes compared to their 40+ minutes. Further, the MUST may be the only option for instructors unable to access their students’ Math SAT scores. The researchers in this study were not approved to get the SAT or ACT scores for our students. Additionally, a number of test-optional universities in the United States are now allowing admission without these standard examinations.

We propose that the 15 minute MUST instrument be considered as an instrument that can be used at the beginning of a Chem I class, because it is easy to obtain, takes little class time, and does not have a cost, unlike others discussed in the background section. Additionally, the MUST is free-response, lacking the various cues and limitations of multiple-choice diagnostics. Scoring of the open-response MUST might be considered a drawback; however, the required responses are specific and can be quickly scored as either correct or incorrect (no partial credit). The MUST has proven to successfully predict the scores of on-sequence, first-semester general chemistry students across multiple institutions. A multiple-institution study has advantages over a single-institution study. With different institutions, different student populations and yet very similar automaticity skills are brought to the classroom, and consistently the students with better arithmetic automaticity consistently outperform those without these foundational skills.

The limitations to the study include the fact that all institutions were from the same state with the same standards for high school graduation; however, the six institutions and student populations were varied in size, ethnicity, and SAT Math (Table 2) to help overcome this and to give a broader picture. The study looked at students in first-semester general chemistry, who were taking the course in the fall semester. The results for these on-sequence students may vary from those who take the course off-sequence, and differences in results may also exist for second-semester students. Finally, it should be noted that with a 78% prediction accuracy for success, that 22% will receive an incorrect prediction.

Possible research questions for others to investigate could include: Would an intervention prior to beginning a Chem I course significantly improve students' number sense as measured by the MUST and would this improve course success? Could a concurrent supplemental mathematics course, similar to that used by Bohning (1982) help students with low MUST scores? Can dormant skills be reawakened by supplemental interventions? What factors lead to the lack of mathematics automaticity? The MUST in addition to other factors like language comprehension (Pyburn et al., 2013) or scale literacy (Gerlach et al., 2014) should also be investigated as should quantitative reasoning and literacy. Research has shown that attitude also plays a significant role in predicting success in chemistry (Xu and Lewis, 2011); therefore, research into the combination of MUST with attitudes and other noncognitive factors should also be investigated. How does calculator usage hinder mathematics automaticity? Does the MUST predict success for the population of off-sequence students who take first-semester general chemistry in the spring semester? Does the MUST predict success in on-sequence, second-semester general chemistry? Would grouping students by differing mathematical ability help in a general chemistry lecture class, as it did in chemistry laboratory, as reported by Srougi and Miller (2018)? With the ease of use and predictive ability, the Math-Up Skills Test or MUST has a number of possible applications.

- Albaladejo J. D. P., Broadway S., Mamiya B., Petros A., Powell C. B., Shelton G. R., Walker D. R., Weber R., Williamson V. M. and Mason D., (2018), ConfChem conference on mathematics in undergraduate chemistry instruction: MUST-know pilot study—math preparation study from Texas, J. Chem. Educ., 95(8), 1428–1429.
- Ambroise C. and McLachlan G. J., (2002), Selection bias in gene extraction on the basis of microarray gene-expression data, Proc. Natl. Acad. Sci. U. S. A., 99, 6562–6566.
- Andrews M. H. and Andrews, L., (1979), First-year chemistry grades and SAT Math scores, J. Chem. Educ., 56(4), 231–232.
- Bodner G. M., (1986), Constructivism: a theory of knowledge, J. Chem. Educ., 63(10), 873–878.
- Bohning J. J., (1982), Remedial mathematics for the introductory chemistry course: the “CHEM 99” concept, J. Chem. Educ., 59(3), 207–208.
- Bunce D. M. and Hutchinson K. D., (1993), The use of the GALT (group assessment of logical thinking) as a predictor of academic success in college chemistry. J. Chem. Educ., 70 (3), 183–187.
- Cohen J., (1992), A power primer. Psychol. Bull., 112, 155–159.
- Cooper C. I. and Pearson P. T., (2012), A genetically optimized predictive system for success in general chemistry using a diagnostic algebra test, J. Sci. Educ. Technol., 21(1), 197–205.
- Dahm D. J. and Nelson E. A., (2013), Calculations in chemistry: an introduction, 2nd edn, New York, NY: W.W. Norton & Company, Inc.
- Deam M. T., (1923), Diagnostic algebra tests and remedial measures. Sch. Rev., 31(5), 376–379.
- Dundes L. and Marx J., (2006), Balancing work and academics in college: Why do students working 10 to 19 hours per week excel? J. Coll. Stud. Ret., 8, 107–120.
- Friedman J., Hastie T., and Tibshirani R., (2010), Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw., 33(1), 1–22. http://www.jstatsoft.org/v33/i01/, accessed June 2019.
- Gerlach K., Traate J., Blecking A., Geissinge P. and Murphy K. L., (2014), Valid and reliable assessments to measure scale literacy of students in introductory college chemistry courses, J. Chem. Educ., 91, 1538–1545.
- Graham M. J., Frederick J., Byars-Winston A., Hunter A. B. and Handelsman J., (2013), Increasing persistence of college students in STEM, Science, 341(27), 1455–1456.
- Hartman J. R. and Nelson E. A., (2016), Automaticity in computation and student success in introductory physical science courses [online], available at: https://arxiv.org/abs/1608.05006, accessed June 2019, Quiz available at: http://bit.ly/1HyamPc, accessed June 2019.
- Hanover Research, (2018), Best practices in first year student support, Arlington, VA: Hanover Research.
- Hovey N. W. and Krohn A., (1963), An evaluation of the Toledo chemistry placement examination, J. Chem. Educ., 40(7), 370–372.
- James G., Witten D., Hastie T. and Tibshirani R., (2013), An Introduction to Statistical Learning: with Applications in R, New York: Springer.
- Johnstone A. H., (2000), Teaching of chemistry-logical or psychological? Chem. Educ. Res. Pract., 1(1), 9–15.
- Jones M. G. and Taylor A. R., (2009), Developing a sense of scale: Looking backward. J. Res. Sci. Teach., 46, 460–475.
- Karpp E., (1995), Validating the California Chemistry Diagnostic Test for local use (Paths to success, Volume III), Glendale Community Coll., CA: Planning and Research Office.
- King J. E., (2006), Working their way through college: Student employment and its impact on the college experience. ACE Issue Brief. American Council on Education Center for Policy Analysis.
- Lammers W. J., Onwuegbizie A. J. and Slate J. R., (2001), Academic success as a function of the gender, class, age, study habits, and employment of college students. Res. Sch., 8(2), 71–81.
- Leopold E. G. and Edgar B., (2008), Degree of mathematics fluency and success in second-semester introductory chemistry, J. Chem. Educ., 85(5), 724–731.
- Lewis S. E. and Lewis J. E., (2007), Predicting at-risk students in general chemistry: comparing formal thought to a general achievement measure, Chem. Educ. Res. Pract., 8(1), 32–51.
- Mason D., (2015), Knowledge decay and content retention of students in first-semester general chemistry, LUMAT – Res. Pract. Math, Sci. Tech. Educ., 3(3), 341–352.
- Mason D. and Mittag K. C., (2001), Evaluating success of Hispanic-surname students in first-semester general chemistry, J. Chem. Educ., 78(2), 256–259. [Correction: J. Chem. Educ., 78(12), 1597.].
- Mason D. and Verdel E., (2001), Gateway to success for at-risk students in a large-group introductory chemistry class, J. Chem. Educ., 78(2), 252–255.
- Mittag K. C. and Mason D. S., (1999), Cultural factors in science education: variables affecting achievement, J. Coll. Sci. Teach., 28(5), 307–310.
- Monserud M. A. and Elder G. H., (2011), Household structure and children's educational attainment: a perspective on coresidence with grandparents, J. Marr. Fam., 73, 981–1000.
- Nunnally J. C. and Bernstein I. H., (1994), Psychometric theory, 3rd edn, New York, NY: McGraw-Hill, Inc.
- Osborne R. J. and Wittrock M. C., (1983), Learning science: a generative process, Sci. Educ., 67, 489–508.
- Pedhazur E. J., (1997), Multiple Regression in Behavioral Research, 3rd edn, Fort Worth, TX: Wadsworth/Thomson Learning.
- Piaget J., (1977), The Development of Thought: Equilibrium of Cognitive Structures, NY: Viking.
- Pienta N. J., (2003), A placement examination and mathematics tutorial for general chemistry, J. Chem. Educ., 80(11), 1244–1246.
- Pyburn D. T., Pazicni S., Benassi V. A. and Tappin E. E., (2013), Assessing the relation between language comprehension and performance in general chemistry, Chem. Educ. Res. Pract., 14, 524–541.
- Ralph V. R. and Lewis S. E., (2018), Chemistry topics posing incommensurate difficulty to students with low math aptitude scores, Chem. Educ. Res. Pract., 19, 867–884.
- Rowe M. B., (1983), Getting chemistry off the killer course list, J. Chem. Educ., 60(11), 954–956.
- Scofield M. B., (1927), An experiment in predicting performance in general chemistry, J. Chem. Educ., 4(9), 1168–1175.
- Sheather S. J., (2009), A Modern Approach to Regression With R, New York: Springer Science + Business Media.
- Shell D. F., Brooks D. W., Trainin G., Wilson K. M., Kauffman D. F. and Herr L. M., (2010), The unified learning model, Dordrecht, Netherlands: Springer.
- Shibley I. A., Milakofsky L., Bender D. S. and Patterson H. O., (2003), College chemistry and Piaget: An analysis of gender difference, cognitive abilities, and achievement measures seventeen years apart, J. Chem. Educ., 80(5), 569–573.
- Sirin S. R., (2005), Socioeconomic status and academic achievement: A meta-analytic review of research, Rev. Educ. Res., 75, 417–453 DOI:10.3102/00346543075003417.
- Snibbe A. D. and Markus H. R., (2005), You can’t always get what you want: Educational attainment, agency, and choice, J. Pers. Soc. Psychol., 88, 703–720 DOI:10.1037/0022-3514.88.4.703.
- Spencer H. E., (1996), Mathematical SAT test scores and college chemistry grades, J. Chem. Educ., 73(12), 1150–1153.
- Srougi M. C. and Miller H. B., (2018), Peer learning as a tool to strengthen math skills in introductory chemistry laboratories, Chem. Educ. Res. Pract., 19, 319–330.
- Stone K. L., Shaner S. E., and Fendrick C. M., (2018), Improving the success of first term general chemistry students at a liberal arts institution. Educ. Sci., 8(1), 5.
- Tobin K. G. and Capie W., (1981), The development and validation of a group test of logical thinking, Educ. Psychol. Meas., 41(2), 413–423.
- Waljee A. K., Higgins P. D. R. and Singal A. G., (2014), A primer on predictive models, Clin. Transl. Gastroenterol., 5, e44.
- Weisman R. L., (1981), A mathematics readiness test for prospective chemistry students. J. Chem. Educ., 58(7), 564.
- von Glasersfeld E., (1995), Radical Constructivism, London, England: Falmer Press.
- Williamson K. C., Williamson V. M. and Hinze S., (2017), Administering spatial and cognitive instruments in-class and on-line: Are these equivalent? J. Sci. Educ. Technol., 26, 12–23.
- Williamson V. M., (2008), in Bunce D. and Cole R. (ed.), Nuts and bolts of chemical education research, Washington, DC: American Chemical Society, pp. 67–78.
- Xu X. and Lewis J. E., (2011), Refinement of a chemistry attitude measure for college students, J. Chem. Educ., 88, 561–568.

## Footnote |

† Electronic supplementary information (ESI) available. See DOI: 10.1039/c9rp00077a |

This journal is © The Royal Society of Chemistry 2019 |