Peer Review

How Reliable Are the VALUE Rubrics?

The VALUE reliability study was developed to gather data on the usability and transferability of rubrics both within and across institutions. This study was also designed to address the degree of reliability and consensus in scoring across faculty from different disciplinary backgrounds. Reliability data were gathered and analyzed for three of the fifteen existing VALUE rubrics—critical thinking, integrative learning, and civic engagement.

The effectiveness of assessment instruments is commonly evaluated by the degree to which validity and reliability can be established. Instruments should, therefore, both accurately capture the intended outcome (validity) and be able to do so consistently (reliability). Because validity is often harder to establish than reliability, it is preferable for assessments to contain multiple forms of validity. In important ways the rubric development process itself provided the VALUE rubrics with substantial degrees of two types of validity. First, because the VALUE rubrics were created nationally by teams of faculty, those people closest to student learning and outcomes assessment on campuses, the rubrics hold a high degree of face validity. The face validity of the rubrics is apparent in the scale of interest and circulation of the rubrics to date, as evidenced by the approximately eleven thousand people from over three thousand institutions and organizations, international and domestic, who have logged in on the AAC&U VALUE web page to access the rubrics.

Second, the specific employment of faculty experts in particular outcome areas to populate the development teams provides the rubrics with additional content validity. Experts are commonly used to establish content validity to verify that “the measure covers the full range of the concept’s meaning” (Chambliss and Schutt 2003, 69).

The objectives for establishing national reliability estimates for the VALUE rubrics were two-fold. One, because the rubrics were created nationally and interdisciplinarily, we sought to emulate this procedure in order to establish a cross-disciplinary reliability score for each rubric. Two, we also sought to establish reliability scores within disciplines to examine the range of similarities and differences across faculty from different disciplinary backgrounds.

Methods

Forty-four faculty members were recruited to participate in the study. Faculty members were evenly distributed across four broad disciplinary areas: humanities, natural sciences, social sciences, and professional and applied sciences. Each faculty member scored three samples of student work for each of the three rubrics. The initial round of scoring functioned as a “calibration round” through which scorers could familiarize themselves with the rubric and scores would be compared to previously determined calibration scores set by rubric experts. (Rubric experts employed to establish baseline scores for rubrics were national representatives already well-acquainted with rubric scoring and specifically with the VALUE rubrics.) If the faculty member’s scores were closely aligned with the pre-determined calibration scores, they were approved to go on to score two more work samples. If there was a divergence in scoring on particular criteria, the faculty member reviewed the discrepancy with the project manager and participated in a final round of calibration scoring before moving on to the additional work samples. The scoring process was conducted virtually. Scorers were given secure links to access samples of student work, rubrics, and the scoring sheet. Scores were entered online and uploaded to a secure spreadsheet for analysis.

The most common procedure for establishing reliability for rubrics is through inter-coder or inter-rater methods, by which two coders evaluate the same work sample, score it according to an approved rubric, and calculate a reliability score. To accommodate scoring from multiple raters, a multi-rater kappa reliability statistic was used. The multi-rater kappa statistic ranges from -1 to 1, where -1 indicates perfect disagreement beyond chance and +1 indicates perfect agreement beyond chance; a score of zero indicates perfect chance agreement (Randolph 2008). In terms of probability measures like reliability estimates, these scores indicate the degree to which raters’ scores are similar without simply occurring by happenstance alone, or put another way, due to dumb luck. The kappa reliability statistic, therefore, takes into account not just the amount of agreement on scores between rates, but also the probability that scorers will both agree and disagree with each other (see Viera and Garrett 2005). All of this information is combined into the single, composite score (ranging between -1 – +1) described above.

A more straightforward measure of scoring consistency was also drawn using the overall percentage of agreement among scores. This statistic simply divides the total number of agreements among scorers by the total number of their agreements plus the total number of disagreements (see Araujo and Born 1985). Thus, the percentage of agreement score will often differ, sometimes markedly, from the score produced from the kappa statistic because the formula for the latter takes into account more information for determining agreement. The kappa statistic is based upon actual scores with additional information drawn from probability estimates, whereas the percentage of agreement calculation is based only on the actual scores and the probability of agreement by chance is not considered. Future work with these data will include additional reliability analyses that use other commonly used measures of inter-rater reliability, such as Krippendorf’s alpha, to provide further interpretation of these results.

Analysis

The analysis of results from the scoring of student work samples were analyzed in three phases. The first phase of analysis assumes the objective is for raters to have exact agreement on scores. Scorers could give each rubric criterion a rating using a five-point scale, 0–4. Thus, the entire five-point scale was used in calculations (see column 1, below). However, in practicality, when working with faculty on campuses it is often not assumed that “perfect agreement” is necessary. It is assumed, rather, that close scores also count as agreement. To calculate “approximate agreement,” scores were reviewed prior to analysis and collapsed into four categories by calculating averages for each criterion and then combining close scores into one category. Additional measures of central tendency were also considered in determining which two response categories would be collapsed (i.e. median and mode) to assure that the fullest majority of data points were included among the two categories identified for consolidation. For example, if the average was 2.0 but the mode and median were 2.0 and 1.5, respectively, ratings of 1 and 2 would be collapsed into a single category. If the average was 3.2 on a criterion, all scorers who gave a rating of 3 or 4 were collapsed into the same category. Results from this analysis are displayed in column 2, below. In a few cases in which all three measures of central tendency were the same, all data points from the sample were scanned to determine the second most frequently identified response category for consolidation.

Figure 1. Results for percentage of agreement and reliability

 
COLUMN 1
Perfect Agreement
COLUMN 2
Approximate Agreement (Using 4 categories)
COLUMN 3
Approximate Agreement (Using 3 categories)
CRITICAL THINKING
Percentage of Agreement
36%
(9%)
64%
(11%)
89%
(7%)
Kappa Score
0.29
(0.11)
0.52
(0.15)
0.83
(0.11)
INTEGRATIVE LEARNING
Percentage of Agreement
28%
(3%)
49%
(8%)
72%
(8%)
Kappa Score
0.11
(0.04)
0.31
(0.11)
0.58
(0.11)
CIVIC ENGAGEMENT
Percentage of Agreement
32%
(5%)
58%
(8%)
78%
(5%)
Kappa Score
0.15
(0.06)
0.44
(0.11)
0.66
(0.08)
COMBINED AVERAGE FOR ALL RUBRICS
Percentage of Agreement
32%
(6%)
57%
(9%)
80%
(7%)
Kappa Score
0.18
(0.07)
0.42
(0.12)
0.69
(0.10)

 

A third iteration of this procedure allowed for greater approximation of the average score by collapsing scores on either side of the average into one category. Therefore, using the previous example, if the average was 3.2, scorers who gave of rating of 2, 3, or 4 on the criterion were collapsed into the same category. This final procedure most closely resembles the process of calibration often used on campuses, where scores immediately adjacent to the most representative score are considered acceptable and only outlier scores are pursued for discussion. Results using this procedure are displayed in column 3, below. Standard deviations between disciplinary reliability scores are indicated in parentheses beneath each interdisciplinary reliability result.

As would be expected, the percentage of agreement increases as fewer rating categories were considered in the calculations, thus limiting the possibility for difference, and the kappa scores also increased greatly (the standard for a “high” or “acceptable” Kappa score is 0.70). Across rubrics results indicate that there was greater agreement than disagreement among scorers (as indicated by positive numbers). The most conservative estimates of agreement and reliability, using the full range of scoring options on a 0–4 scale, indicated that nearly one-third of scorers had perfect agreement, though the overall kappa score was low. Allowing two categories to represent consensus, rather than one, provides a more compelling case for consistency across raters. Among the rubrics tested, the critical thinking rubric, which is also the learning outcome most often cited by faculty as a goal of their own teaching, had the highest degree of agreement and reliability; the integrative learning rubric, the newest learning outcome on the national scene, had the lowest scores on these measures. Relatively small standard deviations suggest that, overall, there was not great variation among scorers across disciplines. The critical thinking rubric had the largest amount of variation among scorers across the three analyses. Integrative learning and civic engagement have comparable levels of divergence among scorers.

Are the Rubrics Reliable?

Consider the following scenarios:

Case 1. A group of faculty who have never met, who work on different and diverse campuses, and were trained in different disciplines are asked to score common samples of student work. This group engages in a single practice round and if issues arise they speak one-on-one with a rubric expert.

Case 2. A group of faculty from the same campus, who may be familiar with each other and perhaps work in the same department, are asked to score samples of student work. This group engages in an interactive practice round with a rubric expert present to discuss interpretation and application of the rubric before scoring work samples.

One could argue this study was conducted under circumstances that would most compromise the probability for finding reliability in scoring. We gathered a diverse sample of national faculty, who as a group had never participated in a common conversation about the structure of the rubrics or about their individual approaches to rubric scoring. These faculty members also did not engage in a standard calibration process through which discrepancies in scoring could be reviewed as a group to better understand differences in interpretation and application of the rubric. Due to time and financial constraints, faculty discussed discrepancies with the project manager by phone without the benefit of hearing other perspectives and input from colleagues.

Given these challenges, the results produced from this study provide an encouraging starting point for larger scale national and campus reliability analyses of the VALUE rubrics. Of particular note is the relative convergence in scoring across raters, regardless of academic discipline. This provides support for the use of rubrics interdisciplinarily at the institutional level—the primary intent in their development. Additionally, the relative strength in the percentage of agreement and moderately high kappa scores, particularly in column 2 above, provide grounds for pursuing future studies that can be more comprehensive, rigorous, and—perhaps necessarily—campus-based.

We hope campuses will look toward case 2, above, as an ideal scenario for developing knowledge on the reliability of the VALUE rubrics. This study was undertaken, despite its caveats, to advance the discussion about the utility of rubrics nationwide and across faculty from different disciplines. It provides a model but not necessarily a guide. The collaborative work of campuses will be essential for carrying this work forward.

References

Araujo, John, and David G. Born. 1985. “Calculating Percentage of Agreement Correctly but Writing Its Formula Incorrectly.” The Behavior Analyst 8 (2): 207–208.

Chambliss, D. F., and R. K. Schutt. 2009. Making Sense of a Social World: Methods of Investigation. Thousand Oaks, CA: Pine Forge Press.

Randolph, J. J. 2008. Online Kappa Calculator. Retrieved September 22, 2011, from http://justus.randolph.name/kappa.

Viera, Anthony J. and Joanne M. Garrett. 2005. “Understanding Interobserver Agreement: The Kappa Statistic.” Family Medicine 37 (5): 360–363.

Previous Issues