Field M. Watts†
* and
Solaire A. Finkenstaedt-Quinn†
*
Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109-1055, USA. E-mail: fieldmw@umich.edu; quinnsa@umich.edu
First published on 15th February 2021
The tradition of qualitative research drives much of chemistry education research activity. When performing qualitative studies, researchers must demonstrate the trustworthiness of their analysis so researchers and practitioners consuming their work can understand if and how the presented research claims and conclusions might be transferable to their unique educational settings. There are a number of steps researchers can take to demonstrate the trustworthiness of their work, one of which is demonstrating and reporting evidence of reliability. The purpose of this methodological review is to investigate the methods researchers use to establish and report reliability for chemistry education research articles including a qualitative research component. Drawing from the literature on qualitative research methodology and content analysis, we describe the approaches for establishing the reliability of qualitative data analysis using various measures of inter-rater reliability and processes including negotiated agreement. We used this background literature to guide our review of research articles containing a qualitative component and published in Chemistry Education Research and Practice and the Journal of Chemical Education from the years 2010 through 2019 for whether they report evidence of reliability. We followed this by a more in-depth analysis of how articles from the years 2017 through 2019 discuss reliability. Our analysis indicates that, overall, researchers are presenting evidence of reliability in chemistry education research (CER) articles by reporting reliability measures, describing a process of negotiated agreement, or mentioning reliability and the steps taken to demonstrate it. However, there is a reliance on reporting only percent agreement, which is not considered an acceptable measure of reliability when used on its own. In addition, the descriptions of how reliability was established were not always clear, which may make it difficult for readers to evaluate the veracity of research findings. Our findings indicate that, as a field, CER researchers should be more cognizant of the appropriateness of how we establish reliability for qualitative analysis and should more clearly present the processes by which reliability was established in CER manuscripts.
For establishing the trustworthiness of qualitative research, some researchers emphasize the need to provide quantitative measures of reliability while others focus on the need to adequately apply the trustworthiness criteria as described by Lincoln and Guba (1985). To illustrate the need for establishing trustworthiness for qualitative research, Armstrong et al. (1997) conducted a study where six experts in qualitative methodologies were tasked with analysing a transcript from a focus group and identifying up to five themes emerging from the data. The researchers then compared the themes that the experts identified. Armstrong et al. (1997) found that while the experts identified similar themes, they presented them differently. This indicates that there will be inherent differences in how researchers approach the same data set and that establishing some form of consistency during the data analysis process can support creating a cohesive interpretation. Ultimately, it is important for researchers to consider trustworthiness and reliability but up to the researcher to determine the appropriate approach for their data and intended analysis. It is also important for researchers to clearly present their analysis, including the important component of establishing reliability, so that readers can better understand and evaluate the results of the research (Phelps, 1994; Towns, 2013; Seery et al., 2019). However, there is currently no review in the literature examining how reliability is established for qualitative CER. As such, the goal of this methodological review is to provide an overview of how researchers establish and describe reliability for qualitative research in CER articles. This overview is intended to inform future directions for how the field considers reporting reliability for qualitative research.
The primary focus of our methodological review—how chemistry education researchers demonstrate the reliability of their analysis in CER articles—aligns with one of the more conventional ways researchers may establish trustworthiness. Krippendorff (2004a) provides two definitions of reliability: that “a research procedure is reliable when it responds to the same phenomena in the same way regardless of the circumstances of its implementation” or “reliability is the degree to which members of a designated community agree on the readings, interpretations, responses to, or uses of given texts or data.” Krippendorff (2004a) describes three types of reliability: stability, reproducibility, and accuracy. Stability can be equated to intra-rater reliability, or how consistently an individual researcher analyses data when applying the same coding scheme over time. Reproducibility captures the consistency between two or more researchers applying the same codes to the same units of data. Accuracy involves comparing the coding of data to some standard that is deemed to correctly capture interpretations. While stability is limited by a single individual's conceptions and interpretations of the data, accuracy presupposes that there is a correct interpretation. Thus, reproducibility is the most commonly used type of reliability, which is often reported with quantitative inter-rater reliability (IRR) measures or through description of a consensus-making process between multiple researchers (Kenny, 1991; Krippendorff, 2004a; Garrison et al., 2006). This conceptualization of reliability may be especially useful in CER, as researchers must communicate their research that is rooted in subjectivist traditions with a disciplinary chemistry audience that is aligned with a more objectivist approach to the scientific process.
Qualitative research in the field of chemistry education provides a rich source of knowledge to both researchers and practitioners about how students learn chemistry, engage with various instructional tools and pedagogical interventions, and relate chemistry to their lives. Qualitative research designs most commonly include analysis of interviews (e.g., tasks and card sorts), observations (e.g., during laboratory activities or as students do group work), or artefacts (e.g., teaching documents or students’ work) (Bretz, 2008). This wide range of research designs is applied to study highly diverse phenomena (e.g., students’ understanding of molecular-level interactions, the development of chemistry knowledge from novice to expert, or instructors’ pedagogical content knowledge for specific chemistry subject matter). With the range of research designs and areas of inquiry, it is important for researchers to clearly describe how they are analysing their data. Additionally, because the qualitative CER tradition provides important findings about how instructors and students engage with the teaching and learning of chemistry, it is of upmost importance for researchers to consider the reliability of their research and to describe how they are doing so in published studies (Phelps, 1994; Towns, 2013; Seery et al., 2019).
The primary goals of this methodological review are to describe whether and how chemistry education researchers have demonstrated reliability in qualitative research articles, and to consider how this informs the ways we report reliability in the future as a field. An ancillary goal of this article is to provide a resource for future chemistry education researchers regarding the considerations for determining and reporting reliability. As such, to supplement the review of the content analysis literature guiding our methodological review, we provide a primer for reporting reliability in Appendix 1. This primer outlines the various considerations for demonstrating reliability when developing and applying a qualitative coding scheme (i.e., unitization of data, the reliability subsample, and reliability methodologies).
To achieve our goals for this review, we establish the methods by which researchers demonstrate reliability when presenting qualitative research in publications from two CER journals, Chemistry Education Research and Practice (CERP) and the Journal of Chemical Education (JCE), over the past ten years. In line with our goals, the review focuses primarily on reliability and does not provide a thorough analysis of other methods for demonstrating trustworthiness. Through an analysis of research publications between 2010–2019, with an in-depth focus on more recent publications from 2017–2019, we identify whether authors discuss establishing reliability and how they describe doing so (e.g., what IRR measures they use). This article is intended to inform CER researchers about the ways that reliability can and has been demonstrated. Furthermore, this review has the goal of describing the reliability concerns chemistry educators should be aware of when reading qualitative research.
A number of articles have described best practices for establishing reliability during the coding process and for reporting reliability in qualitative studies (Lombard et al., 2002; Krippendorff, 2004a; Campbell et al., 2013; Hammer and Berland, 2014). Reliability measures are useful for both developing a coding scheme and applying a finalized scheme to a dataset. To facilitate developing a coding scheme that can be applied in a reliable manner, qualitative researchers offer a number of suggestions. Once researchers begin applying a coding scheme, it is recommended that they take an iterative approach of coding, discussing discrepancies, and refining or revising codes and their definitions to increase the consistency with which codes are applied (Campbell et al., 2013; Miles et al., 2014). This may also entail dropping or combining unreliable codes, but not to the point that the coding scheme loses its ability to capture the themes of interest for the research. During this development process, researchers can use reliability measures as an indication of if and how their coding scheme should be revised (Campbell et al., 2013; Hammer and Berland, 2014). Furthermore, determining reliability measures during the coding scheme development process has the additional benefit of requiring researchers to sufficiently define codes so they can be applied similarly by another researcher. This lends itself to negotiations about whether the codes are accurately capturing the data as intended in a way that supports the analysis, serving to ultimately produce more reliable results (Hammer and Berland, 2014).
Beyond using reliability measures in the process of developing a coding scheme, researchers also calculate reliability measures to provide an indication of the reliability of a finalized coding scheme as it has been applied to the data (Krippendorff, 2004b). Researchers may decide to calculate a reliability measure for a percentage of the dataset—where coding 10–20% of the dataset is recommended—followed by a single researcher coding the remaining data, or they may decide to have multiple researchers code the entire dataset followed by resolving any discrepancies through consensus (Campbell et al., 2013; Neuendorf, 2017). To make this decision, it can be helpful for researchers to consider the reliability of the coding scheme. Specifically, if a coding scheme can be applied such that a high measure of IRR is obtained, there is merit for a single researcher to then apply the scheme to the full data set (Dunn, 1989; Campbell et al., 2013). However, if the researchers struggle to obtain an acceptable value of IRR, Campbell, et al. (2013) argue that it can be followed by each researcher coding the entire data set, then comparing their coding and resolving as many discrepancies as possible. The researchers should then provide a value indicating the inter-rater agreement. An alternative approach is coding by negotiated agreement, where researchers first independently code the data then meet to discuss the codes and decide the final application of the coding scheme for the entire dataset (Garrison et al., 2006). Determining the appropriate method to demonstrate reliability is important and depends to a certain extent on the data itself. For example, pursuing a high measure of reliability for complex data, such as semi-structured interviews, can lead to loss of validity as researchers attempt to simplify the coding scheme or avoid codes that require more interpretation (Krippendorff, 2004c; Campbell et al., 2013; Neuendorf, 2017). Thus, in such cases where the coding scheme is complex and more interpretation is required, utilizing inter-rater agreement or a consensus coding method may be merited (Garrison et al., 2006; Campbell et al., 2013).
After determining an approach to demonstrate reliability, researchers must consider the unitization of data and, if using a measurement to indicate reliability as opposed to negotiated agreement, select a measure appropriate for their coding process. Common measurements used for indicating reliability include percent agreement, correlation coefficients, and chance-corrected agreement coefficients (Neuendorf, 2017). Despite percent agreement being a popular measure, it has been critiqued for being inappropriate for describing reliability as it does not account for variation between coders or agreement due to chance (Lombard et al., 2002; Krippendorff, 2004a; Neuendorf, 2017). Correlation coefficients, such as Pearson's r and intraclass correlation coefficients (ICCs), account for the variation between coders. However, Pearson's r fails to account for actual agreement as it instead measures degree of linearity. As such, Pearson's r is suggested to be a less applicable correlation coefficient in comparison to ICCs (Neuendorf, 2017). Chance-corrected agreement coefficients, including the commonly-used Cohen's kappa and Krippendorff's alpha, adjust simple percent agreement calculations by considering the probability that researchers agree due to chance (Cohen, 1960; Krippendorff, 2004a). Hence, they are suggested to be the most applicable and accepted reliability measures (Krippendorff, 2004a; Neuendorf, 2017). More information about each of the common IRR measures, along with the concerns of unitizing data and determining the reliability subsample, is provided in Appendix 1. Within the appendix, we provide resources for how to calculate the various measures and describe the type of data they apply to, their limitations, and variations that have been developed for each. The guidelines for how to interpret each of the appropriate reliability measures are summarized in Table 1.
Measure | Value | Interpretation | Extensions, variations, or alternatives |
---|---|---|---|
Intraclass correlation coefficients | 0.91–1.00 | Excellent | Lin's concordance correlation coefficient |
0.76–0.90 | Good | ||
0.51–0.75 | Moderate | ||
0.00–0.50 | Poor | ||
Cohen's kappa | 0.91–1.00 | Almost perfect | Weighted kappa |
0.80–0.90 | Strong | Fleiss’ kappa and Light's kappa | |
0.60–0.79 | Moderate | Fuzzy kappa | |
0.40–0.59 | Weak | Gwet's AC1 | |
0.21–0.39 | Minimal | ||
0.00–0.20 | None | ||
Krippendorff's alpha | 0.80–1.00 | Reliable value | N/A |
0.67–0.79 | Acceptable for tentative conclusions | ||
0.00–0.66 | Not acceptable |
To narrow the scope of this review, we developed further selection criteria (see Fig. 1) to encompass only articles presented as qualitative or mixed methods chemistry education research. As such, articles that did not include information about study design and a qualitative research methodology were excluded from further analysis (N = 86). These included practice papers that primarily served to describe a pedagogical strategy or intervention that included students’ perceptions or experiences but, upon further examination, were deemed to not include a research-based evaluation. Articles that were removed at this stage also included chemistry education research articles that, upon closer examination, did not include a qualitative component. The remaining articles comprised those which included analysis of qualitative data sources (N = 573). The qualitative data sources represented in our sample included studies analysing semi-structured and think-aloud interviews, open-ended survey or exam questions, and drawn or written student artefacts, in alignment with the common types of qualitative data outlined by Bretz (2008). All of these articles were included in our analysis of whether reliability is being reported in CER articles. The subset of articles from 2017–2019 (N = 236) were subjected to more detailed evaluation guided by our analytical framework to address our second goal of understanding how reliability is reported in CER articles (Fig. 1). We conducted the more detailed analysis in reverse chronological order (i.e., starting with articles from 2019) and concluded the analysis when saturation was reached and we were no longer identifying additional trends in the data (Miles et al., 2014). As such, this second level of analysis focused on CER articles published in the last three years.
For each article designated as not including qualitative reliability measures or descriptions of negotiated agreement (N = 262), we examined the portions of the methods section where developing and applying codes to the data was described. Within this section, we sought to identify mention of the terms “reliability” or “trustworthiness,” or descriptions of the researchers engaging in some process that could be viewed as establishing the reliability of their coding. This means that we did not necessarily identify articles using strategies such as triangulation or an inquiry audit to appeal to the trustworthiness criteria described by Lincoln and Guba (1985) unless these procedures were described as being done to establish trustworthiness or reliability of the coding process. We justify this decision because the focus of our analysis is trustworthiness as demonstrated through reliability. For the articles from 2017–2019 subjected to a more detailed analysis (N = 109), we extracted additional information about how reliability was mentioned and identified themes for how reliability was discussed (Braun and Clarke, 2006).
We also examined trends over time in whether and how reliability is presented. Of note, there appears to be a decrease in the fraction of articles that do not contain any discussion of reliability from 2010 to 2019 (Fig. 2). A chi-squared test comparing the number of articles assigned to each approach between 2017–2019 and 2010–2016 indicates that there is a statistically significant difference in the frequencies between the two time ranges across the five categories (chi-squared = 37.3656, p < 0.001). Exemplified by Table 2, this shift may align with an increase in the fraction of articles that we identified as containing some mention of reliability other than providing a measure or describing complete consensus and a corresponding decrease in the fraction of articles that did not contain any mention of reliability.
Measure | Other measure | N | Mean | St. dev. | Min. | Med. | Max. |
---|---|---|---|---|---|---|---|
Consensus only | — | 37 | — | — | — | — | — |
Percent agreement only | — | 44 | 0.91 | 0.07 | 0.70 | 0.92 | 1 |
Percent agreement with another measure | Cohen's kappa | 10 | 0.92 | 0.04 | 0.85 | 0.91 | 1 |
Krippendorff's alpha | 1 | 0.95 | — | — | — | — | |
Percent agreement with another measure followed by consensus | Cohen's kappa | 2 | 0.97 | 0.04 | 0.94 | 0.97 | 1 |
Krippendorff's alpha | 1 | 0.92 | — | — | — | — | |
Percent agreement followed by consensus | — | 17 | 0.86 | 0.08 | 0.75 | 0.87 | 1 |
Percent agreement (all) | — | 75 | 0.90 | 0.07 | 0.7 | 0.91 | 1 |
Pearson's r | — | 1 | 0.99 | — | — | — | — |
Intra-class correlation (ICC) | — | 4 | 0.87 | 0.08 | 0.75 | 0.90 | 0.92 |
Cohen's kappa only | — | 20 | 0.87 | 0.11 | 0.50 | 0.90 | 0.99 |
Cohen's kappa with another measure | Percent agreement | 10 | 0.82 | 0.06 | 0.74 | 0.80 | 0.90 |
Cohen's kappa with another measure followed by consensus | Percent agreement | 2 | 0.69 | 0.37 | 0.43 | 0.69 | 0.95 |
Cohen's kappa followed by consensus | — | 1 | 0.77 | — | — | — | — |
Cohen's kappa (all) | — | 33 | 0.84 | 0.12 | 0.43 | 0.88 | 0.99 |
Krippendorff's alpha only | — | 5 | 0.86 | 0.06 | 0.78 | 0.88 | 0.92 |
Krippendorff's alpha with another measure | Percent agreement | 1 | 0.80 | — | — | — | — |
Krippendorff's alpha with another measure followed by consensus | Percent agreement | 1 | 0.84 | — | — | — | — |
Krippendorff's alpha (all) | — | 7 | 0.85 | 0.05 | 0.78 | 0.84 | 0.92 |
Measure not specified | — | 3 | 0.94 | 0.03 | 0.91 | 0.94 | 0.97 |
Measure not specified followed by consensus | — | 1 | 0.80 | — | — | — | — |
Measure not specified (all) | — | 4 | 0.91 | 0.07 | 0.80 | 0.93 | 0.97 |
No measure or consensus (in article that provides measure or consensus for another data source) | — | 21 | — | — | — | — | — |
Total data sources | — | 182 | — | — | — | — | — |
Percent agreement was the most commonly reported reliability measure across the articles from 2017–2019 and the overall average percent agreement reported was 90% (Table 3). For 13 data sources, researchers described following the reliability process suggested by Campbell et al. (2013) but did not reach complete consensus at the inter-rater agreement stage. Researchers did report a percent agreement value followed by reaching complete consensus for 17 data sources, aligning with the process of negotiated agreement reported by Garrison et al. (2006) or the process of inter-rater agreement described by Campbell et al. (2013). Researchers reported percent agreement alongside a chance-corrected reliability measure (either Cohen's kappa or Krippendorff's alpha) in 14 data sources, and in three of these instances they also reported reaching consensus. The different percent agreement values reported for each combination with different IRR measures or negotiated agreement are presented in Table 3. The prevalence of researchers reporting percent agreement alone is to be noted, as percent agreement is often cited as being inappropriate for demonstrating the reliability of a coding scheme (Krippendorff, 2004a; Neuendorf, 2017). However, the number of data sources for which percent agreement is reported alongside another measure or in conjunction with researchers performing consensus may indicate that researchers are recognizing that percent agreement is not viewed as an acceptable stand-alone measure for indicating reliability (Krippendorff, 2004a; Neuendorf, 2017). Furthermore, this trend aligns with shifts in other fields (Hughes and Garrett, 1990; Lombard et al., 2002). The move away from percent agreement originates from its inability to account for variation in researchers’ application of codes or the possibility of agreement by chance (Krippendorff, 2004a; Neuendorf, 2017). Hence, percent agreement is not recommended for reporting IRR unless accompanied by an IRR measure (Neuendorf, 2017).
The second-most commonly reported measure of IRR was Cohen's kappa or a variation of the kappa statistic (Table 3). This aligns with reports in other disciplines, which also identify Cohen's kappa as the most commonly reported statistic for measuring reliability (Riffe and Freitag, 1997; Lombard et al., 2002; Manganello and Blake, 2010). The overall average reported kappa value of 0.84 (Table 3) falls within the range of “strong” agreement for interpreting kappa (Table 1). For 13 data sources, researchers also reported other approaches to demonstrate the reliability of their analysis (Table 3). It is a positive result that the second-most reported reliability measure is one that accounts for agreement by chance, as measures within this class are thought to best reflect reliability for content analysis. However, as the most commonly reported chance-corrected reliability measure within CER, it is worth noting that Cohen's kappa has many limitations—including that it only allows for two coders, one code per unit of analysis, and that it can produce values which do not accurately reflect agreement when there are skewed distributions of applied codes or when coders have similar distributions of applied codes (Gwet, 2002; Krippendorff, 2004a; Warrens, 2010; Neuendorf, 2017). There may be some movement to address these limitations within CER, as exemplified by a small number of researchers within the discipline utilizing extensions that overcome these limitations. Specifically, extensions of Cohen's kappa were reported for three data sources, with one each reporting Light's kappa, fuzzy kappa, and Gwet's AC1 metric. Light's kappa allows for multiple coders, while fuzzy kappa allows for multiple codes to be applied to a single unit of analysis (Light, 1971; Kirilenko and Stepchenkova, 2016). Gwet's AC1 metric accounts for the problems inherent in the calculation of kappa and provides values that more accurately reflect agreement (Gwet, 2002). That researchers are utilizing these measures may indicate a positive shift in the field towards using measures which allow for different types of coding procedures that align with the research goals and that better reflect the agreement between researchers.
Few researchers reported a Krippendorff's alpha, with the measure reported for seven data sources (Table 3). Researchers also reported other approaches to demonstrate reliability for two of the data sources (Table 3). The values reported for Krippendorff's alpha generally fell above Krippendorff's suggested 0.80 cut-off for taking the results as reliable, with an overall average of 0.85 (Table 1). That few researchers reported using Krippendorff's alpha aligns with similar findings in other disciplines, where Cohen's kappa is the more frequently utilized chance-corrected agreement measure (Riffe and Freitag, 1997; Lombard et al., 2002; Manganello and Blake, 2010). However, it may be beneficial for more researchers to begin using Krippendorff's alpha, as it does not have many of the limitations of Cohen's kappa and is thus more broadly applicable to a range of coding procedures to address a range of research questions. This is an important benefit, as CER draws on diverse qualitative data sources that can range from open-ended exam questions to interview data, each of which may be analysed in different ways (e.g., ordinal coding for exam responses, or nominal coding for interview responses). Specifically, Krippendorff's alpha is useful for nominal, ordinal, or interval data, and is suitable for situations with small samples of coded data, multiple raters, or incomplete data (Krippendorff, 2004a). Furthermore, in contrast to Cohen's kappa, Krippendorff's alpha is not limited in situations when researchers have similar or skewed distributions of codes (Gwet, 2002; Krippendorff, 2004a; Warrens, 2010; Neuendorf, 2017).
Researchers reported correlation coefficients relatively infrequently, with only descriptions of analysis of five data sources including a correlation coefficient value to describe agreement between researchers. For these data sources, one reported a Pearson's r coefficient, while the other four reported an ICC value. This average reported ICC value was 0.87 and falls within the interpretation of achieving “good” reliability (Table 1). The minimal number of researchers reporting Pearson's r is promising, as the literature indicates it is not necessarily an appropriate correlation coefficient for demonstrating reliability (Krippendorff, 2004a; Watson and Petrie, 2010). The inappropriateness of Pearson's r is specifically because it responds to differences in linearity as opposed to differences in agreement between two researchers (Krippendorff, 2004a; Watson and Petrie, 2010; Neuendorf, 2017). That the majority of researchers reporting a correlation coefficient are reporting an ICC value is notable, as this value is a more acceptable measure of IRR (Watson and Petrie, 2010; Neuendorf, 2017). ICCs and similarly calculated correlation coefficients are thought to be more appropriate because they account for covariation between researchers’ applications of codes in addition to identifying deviance from perfect agreement (Neuendorf, 2017).
The average reported values for ICCs and both of the chance-corrected agreement coefficients were relatively high, indicating the standards of reliability across CER articles that include these measures. It is important to note the relatively high variation among some of these reported values, as indicated by the standard deviations presented in Table 3, which indicates that data analyses with values below the highest ranges on the interpretation scales (e.g., values below the ranges of “almost perfect” and “strong” for Cohen's kappa) are being published. The reported values below the highest interpretation ranges are nevertheless acceptable for making tentative conclusions (Krippendorff, 2004a). Additionally, the lower values could be an artefact of the acknowledged limitations and paradoxes associated specifically with Cohen's kappa (Krippendorff, 2004a; Krippendorff, 2004b; Warrens, 2010; Neuendorf, 2017). Thus, if researchers do obtain lower reliability values, they should consider possible justifications or implications for the lower values, the strength of the claims they can make, whether another measure is more appropriate for their coding procedure, or if it would be appropriate or feasible to perform negotiated agreement on the full data set. Some, but not all, of the researchers reporting lower IRR values within the analysed articles did report taking these additional measures.
While the reported IRR measures are generally within acceptable ranges, it is important to emphasize that the data sources for which ICCs or chance-corrected agreement measures were reported only made up 24% (N = 44) of the data sources within articles for which a measure was reported—while simple percent agreement alone or describing a consensus-making process without providing a reliability measure was more common (45%, N = 81). These findings indicate that while, on average, good reliability is being demonstrated when an agreement measure is reported, there is a need for researchers to report the appropriate IRR measures in CER articles.
Another key finding is that 30% (N = 55) of data sources did not report an ICC or chance-corrected agreement coefficient but did describe coding using consensus or negotiated agreement (Table 3). This methodology is useful for reducing errors in the analysis or minimizing the subjectivity imposed by a single researcher and can be useful for complex data that may be difficult to code in a reliable fashion (Garrison et al., 2006). However, it has been criticized for not directly appealing to the notion of reliability (Krippendorff, 2004a). Thus, if researchers decide that the consensus method is appropriate for their situation, they should keep in mind the different viewpoints in the content analysis literature regarding whether or not it accounts for reliability.
We also identified 21 qualitative data sources for which neither a measure nor consensus were reported despite reliability being demonstrated for other data sources in the articles. While these were primarily secondary data sources, it is still recommended that researchers demonstrate reliability for all components of their analysis. Lastly, there were four instances where researchers reported a value but did not specify the measure, which makes it difficult for the reader to assess the level of reliability of the analysis. Hence, it is important for researchers to provide sufficient detail pertaining to the steps taken to establish reliability so readers can evaluate the appropriateness of the reported procedure for the presented analysis (Towns, 2013; Seery et al., 2019). When choosing to report a measure of IRR, this includes specifying how the data was unitized, the amount of data that was coded to demonstrate reliability, if a process of negotiated agreement was used in tandem with calculating reliability measures, and only reporting percent agreement in conjunction with another measure or complete consensus.
About a third of the articles in this category mentioned trustworthiness either exclusively or as part of their discussion of reliability (N = 30). These contained descriptions of the researchers engaging in elements of trustworthiness such as discussion within the research team as analysis was being performed, triangulation of the analysis across data sources, member checking, and discussion with an external researcher about the coding scheme or themes derived from the analysis. Triangulation and discussion within the research team were most commonly described and are both modes that Lincoln and Guba (1985) describe for establishing dependability, the construct which they align with reliability. However, the strongest method for establishing dependability, as argued by Lincoln and Guba (1985), is engaging an external researcher to perform an “inquiry audit” of the research process. While the descriptions of discussion with an external researcher do align with the idea of an inquiry audit, as described they are not as all-encompassing or thorough as the audit process recommended by Lincoln and Guba (1985). Thus, if researchers determine the naturalistic approach for demonstrating trustworthiness to be appropriate for their research, it is important for them to carefully consider the necessary steps for doing so.
Our analysis does indicate a positive increase in the number of researchers providing measures of reliability or describing a process of negotiated agreement for the analysis of qualitative data. Furthermore, our detailed analysis of articles from 2017 through 2019 indicates that researchers are establishing reliability using a variety of IRR measures, including Cohen's kappa, Krippendorff's alpha, and ICCs. Despite the availability and use of a variety of IRR measures, however, many chemistry education researchers are reporting only simple percent agreement, a measure which is criticized by many researchers for not providing an accurate indication of reliability (Lombard et al., 2002; Krippendorff, 2004a; Neuendorf, 2017). While there is debate within the field of content analysis about the most appropriate measure for IRR, ICCs and the chance-corrected agreement coefficients—Cohen's kappa and Krippendorff's alpha—are generally accepted to be the most appropriate (Neuendorf, 2017). Researchers should thus carefully consider their coding process and the complexity of their data to determine whether using a measure or negotiated agreement process is more appropriate. If researchers choose to use a reliability measure, they should also consider the appropriate applications of each measure to determine which is most appropriate for their use and, if using Cohen's kappa, whether one of the variations is needed.
Of the articles that did not mention a reliability measure or engage in negotiated agreement for the full data set, only half mentioned considering the reliability or trustworthiness of their analysis. The majority of the articles that did contain some mention of these either described a process involving multiple researchers, triangulation, or discussing the coding with an external researcher. While engaging in these processes are beneficial for coding scheme development, researchers must determine whether engaging in and describing these processes is sufficient for establishing the reliability of their analysis. However researchers choose to establish reliability or trustworthiness—whether it be through calculating an IRR measure or describing the steps taken to determine reliability—it is key to consider if and how the chosen approach can influence the limitations of their research. This is important so practitioners and researchers can fully understand the approaches taken during data analysis to arrive at the results of a particular study.
It is also worth noting that, in our sample, the discussion of reliability was often difficult to identify, as exemplified by the IRR measures reported for our own analysis (79% agreement with Cohen's kappa of 0.72—indicating moderate agreement—and Krippendorff's alpha of 0.72—indicating agreement acceptable for tentative conclusions). Our approach of additionally engaging in negotiated agreement to reach complete consensus when analysing data for this review—an approach also present within articles in our data set—was hence useful for classifying ambiguous discussions of reliability. One implication of our moderate reliability values is that even a seemingly straightforward coding scheme can be difficult to apply and might warrant the use of consensus coding or negotiated agreement, especially when considering the complexity of data often analysed in CER. Additionally, our moderate agreement values indicate that the discussion of reliability within the articles included in our data set was not always clear or easy to identify. Relatedly, we note that in some articles it was ambiguous when researchers reported utilizing negotiated agreement during the coding process whether complete consensus was reached. As such, we suggest that details pertaining to reliability should be clearly discussed in the methods section of CER articles with clear indication of the specific IRR measures calculated, if any—which was not the case in all articles within our data set. As suggested in editorials for both CERP and JCE (Towns, 2013; Seery et al., 2019), incorporating clear demonstration of the steps taken to establish the reliability of qualitative analyses in CER will ultimately serve to strengthen the rigor of the field so both researchers and practitioners can make better sense of the ways they can incorporate key findings and results into their own future research or instructional practice.
Code | Definition |
---|---|
Measure | The article contains a description of the specific reliability measure used to determine IRR and provides the corresponding values. |
Negotiated agreement | The article contains a description of researchers engaging in the process of negotiated agreement for the full set of data being analyzed. |
Measure and negotiated agreement | The article contains a description of both a reliability measure and researchers engaging in negotiated agreement. This can be for the same data source or different sources of data in the same article. |
Mention | The article contains a description of using some method to ascertain the reliability or trustworthiness of the analysis other than using a reliability measure or engaging in negotiated agreement for the full data set. |
No mention | The article contains no description of reliability or trustworthiness. |
Footnote |
† Both authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2021 |