Peer Review

Increasing the Validity of Outcomes Assessment

Increasing accountability in higher education has prompted institutions to develop methods to meet growing expectations that they will implement and demonstrate a commitment to the assessment of student learning outcomes. These responses range from ensuring institutional compliance with the mandatory requirements of accrediting bodies to the adoption of institution-specific assessment regimens designed to more closely align with local conditions and cultures. One such regimen is provided by the Voluntary System of Accountability (VSA), and this article explores how the University of Delaware (UD) implemented the Association of American Colleges and Universities’ VALUE rubrics, as part of its ongoing initiative to create a campus-wide culture of assessment. As UD implemented the VALUE system for assessment, both the value of local adaptation and the limitations of adopting a national testing regimen have become increasingly apparent.

In 2007, the VSA was initiated by public four-year universities and two of the higher education associations—the Association of Public and Land-grant Universities and the Association of State Colleges and Universities to supply comparable information on the undergraduate student experience through the use of three standardized tests to assess core general education (Gen Ed) skills of critical thinking, reading, writing, and mathematics (quantitative reasoning). The argument for standardized tests is that they provide the best way to assess student learning through universal, unbiased measures of student and school performance. Critics claim, however, that these tests fail to assess accurately students’ knowledge and school performance, may institute a bias against underserved populations, and are not reliable measures of institutional quality (Beaupré, Nathan, and Kaplan 2002). The VSA advocates the use of one of three standardized tests—UD chose to use the Educational Proficiency Profile (EPP). After an initial experience with this test, UD decided to work with the VALUE project and its defined core learning goals and modifiable assessment rubrics. Choosing this path, institution leaders believed, would allow UD to examine student learning with greater sensitivity to local conditions and obtain more useful information on the quality and type(s) of student learning outcomes that are most challenging to evaluate via standardized tests.

Validity of EPP and VALUE Results

UD’s Office of Educational Assessment (OEA) administered the abbreviated EPP to 196 first-year students and 121 seniors in fall 2010. ETS subsequently provided results in the form of scaled scores on the four core skill areas (reading, critical thinking, writing, and mathematics) as well as an aggregated individual score for each test taker. ETS stresses that institutions should not focus on individual results on the abbreviated EPP but instead concentrate on the aggregate scores. UD implemented the abbreviated version of the EPP, and ETS only provides mean scores from the long version of its Proficiency Profile Total Test. Since ETS did not provide mean comparative scores for the actual test UD students had taken, there was no precise basis for comparison, notwithstanding ETS’s claim that the abbreviated test provides equivalent results. Thus, the actual scores obtained from ETS could offer little guidance in understanding how students at UD performed in comparison to test populations at other universities. Additionally, the lower number of questions in the abbreviated EPP created concern over content validity compared to results created with the longer version. There were, for example, only a total of nine questions designed to examine student proficiency in the area of quantitative reasoning.

In terms of the local testing environment, the methods required to obtain first–year students’ and seniors’ participation in this Gen Ed study may well have affected the validity of the results obtained with the EPP. Freshmen who took the EPP were largely from First Year Experience (FYE) courses where faculty were asked to give up class time to offer students a chance to take this test. Seniors were mostly recruited from one class period in three different senior capstone courses. Finally, as an additional strategy to recruit seniors, OEA staff placed open advertisements around campus in large classroom buildings requesting senior students to participate in the assessment, eventually offering a chance to win an iPad2 as an incentive to take the EPP.

These concerns over the validity of the EPP results motivated UD to explore a different approach to assessing longitudinal learning outcomes. In summer 2011, the OEA hired three UD faculty members to evaluate Gen Ed competencies of critical thinking and quantitative reasoning. These Faculty Assessment Scholars (FASs) examined the artifacts provided by the same first-year students and seniors who also took the fall 2010 EPP. Since the student work samples assessed with the VALUE rubrics were provided by the same students who took the EPP, these data had the same bias challenges but, because the artifacts were selected by the students as their best work, this evaluation of authentic work samples did provide greater face validity. The FASs attended an hour-long training session to calibrate the two VALUE rubrics adapted by the OEA, one for assessing critical thinking skills (VALUE inquiry and analysis rubric) and one for assessing quantitative reasoning skills. While this was a limited calibration exercise, it proved to be a successful model and one that could be readily enhanced in future assessment projects.

Inter-Rater Reliability Testing

Each criterion on the three separate rubrics (first-year critical thinking, senior critical thinking, and senior quantitative reasoning) assessed by the FASs was tested for inter-rater reliability. The inter-rater reliability among the three FASs was examined utilizing the intraclass correlation coefficient. The intraclass correlation coefficient is a measurement of agreement among multiple raters on the same set of criterion on the VALUE rubrics. For each criterion, ratings by all three FASs for all students’ work samples were examined in a two-way mixed model to obtain intraclass correlation coefficients. The intraclass correlation coefficient, as a measure of inter-rater reliability, has a range of -1.0 indicating extreme heterogeneity to 1.0 indicating complete homogeneity (Kish 1965, 171). Ratings closer to 1.0 indicate less variation among the three FASs in evaluating student artifacts. Since there were six criteria and three rubrics, eighteen two-way models were produced.

Not all of the students who took the EPP chose to provide work samples; therefore, the OEA selected the artifacts to be evaluated based upon whether they contained enough information to be evaluated via the VALUE rubric before releasing them to the FASs. A total of ten first-year student critical thinking artifacts, ten senior critical thinking artifacts, and ten senior quantitative reasoning artifacts were evaluated by the FASs. The FASs utilized the quantitative reasoning rubric adapted by OEA to assess the ten artifacts from seniors; none of the first-year students provided an artifact. Table 1 illustrates the ratings given according to each criterion on this rubric. The ten quantitative reasoning artifacts were rated on a scale of 1 to 4 for the six criteria in this rubric. The table also illustrates the percentage of ratings received by seniors for possible rating on a scale from 1 to 4. The mean rating for seniors on each quantitative reasoning criterion is also provided.

Table 1. Quantitative reasoning assessment of seniors

Quantitative Reasoning
Rating
Seniors’ Percentage
Interpretation
4
52%
 
3
43%
 
2
5%
 
1
0%
 
 
Means = 3.48
Representation
4
57%
 
3
29%
 
2
14%
 
1
0%
 
 
Means = 3.43
Calculations
4
75%
 
3
12.5%
 
2
12.5%
 
1
0%
 
 
Means = 3.63
Application/Analysis
4
58%
 
3
29%
 
2
13%
 
1
0%
 
 
Means = 3.46
Assumptions
4
50%
 
3
22%
 
2
22%
 
1
6%
 
 
Means = 3.17
Communication
4
52%
 
3
32%
 
2
16%
 
1
0%
 
 
Means = 3.36

 

The UD FASs utilized the VALUE inquiry and analysis rubric, adapted by OEA to assess the Gen Ed competency of critical thinking. FASs reviewed twenty artifacts from first-year students and seniors (ten in each group). Table 2 illustrates the percentage of ratings received by first-year students and seniors for possible rating on a scale from 1 to 4 on the six criteria in the inquiry and analysis rubric.

Despite the fact that the FASs received minimal training to rate students’ work, these findings indicate a high level of agreement.

Table 2. Critical thinking, inquiry, and analysis assessment of first-year students and seniors

Critical Thinking Criterion
Rating
First-Year Student Percentage
Seniors Percentage
Gain in Mean
Topic Selection
4
30%
69%
 
 
3
44%
24%
 
 
2
19%
7%
 
 
1
7%
0%
 
 
 
Mean = 2.96
Mean = 3.55
0.59
Existing Knowledge
4
28%
57%
 
 
3
24%
29%
 
 
2
43%
14%
 
 
1
5%
0%
 
 
 
Mean = 2.76
Mean = 3.43
0.67
Design Process
4
19%
61%
 
 
3
48%
31%
 
 
2
28%
8%
 
 
1
5%
0%
 
 
 
Mean = 2.8
Mean = 3.58
0.78
Analysis
4
26%
38%
 
 
3
22%
45%
 
 
2
37%
17%
 
 
1
15%
0%
 
 
 
Mean = 2.60
Mean = 3/10
0.5
Conclusions
4
29%
47%
 
 
3
42%
47%
 
 
2
21%
6%
 
 
1
8%
0%
 
 
 
Mean = 2.92
Mean = 3.40
0.48
Limitations and implications
4
30%
41%
 
 
3
40%
48%
 
 
2
25%
11%
 
 
1
5%
0%
 
 
 
Mean = 2.55
Mean = 3.30
0.75
 
 
 
Average Gain: 0.63

 

Costs

Although the expense of using the Education Testing Service’s EPP to examine Gen Ed competencies of UD undergraduates was not the determining factor in deciding on the VALUE rubrics as an alternative method of assessment, the costs were certainly substantial in relation to the benefits derived. By contrast, the direct and indirect expenses required to implement the locally-modified rubrics were far more modest. ETS charged $7,000 for the scantron sheets, test booklets, and the processing and reporting of the test scores. Table 3 lists costs incurred for the EPP and costs necessary to utilize the VALUE rubric to assess Gen Ed competencies of undergrads independent of the EPP.

 

Table 3. Costs associated with implementing each test individually

ITEM
EPP
VALUE RUBRIC
Tests and processing fees charged by ETS
$7,000
Pencils for test completion
$47
OEA staff time to analyze data sent from ETS vs. Rubric Data
60hours
25 hours
Registrar and OEA secretary time to secure classrooms for testing sessions
10 hours
Candy for recruiting volunteers
$41
iPad2
$499
OEA Staff time to recruit faculty in charge of the FYE and senior capstones to participate in open testing sessions
30 hours
OEA staff time to attend all the FYE courses, capstone classes, and open sessions
41.5 hours
Open sessions for senior volunteers: six senior “open periods”
25 hours
Class periods taken from FYE courses: eleven periods
UC
Class periods taken from senior capstones: either periods
UC
Costs to hire 3 Faculty Assessment Scholars: $705 x 3
$2,250
Hours = Indicates approximate time to implement this item
UC = Unspecified costs

 

Discussion

Implementing the abbreviated version of the EPP was time-consuming, expensive, and yielded results in which neither the UD FASs nor the OEA had great confidence. Because of the potential bias from students self-selecting to take the standardized test, there was little assurance that the results could be generalized to the student population at UD. Also, since the EPP is a low-stakes test and the abbreviated version provides a less intensive examination of skill levels than the more expensive and much longer version (40 minutes for the abbreviated versus 120 minutes for the standard EPP), its results could not be validly extrapolated as a measure of UD students’ learning outcomes.

By contrast, using the VALUE rubrics allowed the creation of an assessment regimen that was far more closely attuned to the actual learning experiences of UD students. It was possible, for instance, to obtain authentic samples of students’ artifacts with minimal interruption to class time, and this model of data collection is readily expandable. An even more thorough representation of student work could be obtained, for example, through implementing e-portfolios with the first-year experience course: because this course is required for all UD students, faculty in charge of these courses could require that all freshmen students upload artifacts of their work so that a baseline measure could be obtained. A similar process could easily be implemented in senior capstone experience courses. Since almost all UD majors currently require a senior capstone experience, the artifact collection would be from the widest breadth of majors and consequently would allow the OEA comparative measures between first-year students’ and seniors’ Gen Ed competency levels. These methods of collecting student work would also eliminate the costs of enticing volunteers with candy and expensive prizes, and they would save time by eliminating recruitment sessions and time required to administer standardized tests.

The use of a rubric tool, with its ability to isolate criteria for each goal, provided more useful and actionable information than simple levels of proficiency on large nebulous skills such as CT. For example, when UD FASs evaluated critical thinking skills, they observed that students’ abilities to make assumptions and draw conclusions were their weakest skills. Results such as these allow the OEA to communicate to academic programs the need for students to be provided with further opportunities to develop this fundamental skill.

Good assessment should be cost effective and provide useful data with actionable results. By using adaptable VALUE rubrics and ensuring that the descriptors most closely match an institution’s skill definitions, student learning can be assessed in the most meaningful way. Students will complete graded assignments and these are most easily collected without an intrusion on class time. Students can be directed to upload assignments to a secure website where assessment faculty and staff can review authentic student work. A systematic assessment process that permits baseline data to be obtained for freshmen that can also be used to easily examine specific Gen Ed goals provides results that faculty can use to positively affect student learning outcomes. Therefore, as institutions continue to be pressured by external accreditation, they need to involve faculty in a cost-effective assessment process that allows them to connect assessment of Gen Ed to what is actually occurring in their courses and disseminate results to focus upon improving student learning. UD’s experience using the VALUE rubrics and a team of Faculty Assessment Scholars created just this kind of involvement and can serve as a model of collaboration between academic faculty and administrative offices engaged in the institutional work of measuring—and improving—student learning outcomes.

References

Beaupré, D., S. B Nathan, and A. Kaplan. 2002. “Testing Our Schools: A Guide for Parents.” http://www.pbs.org/wgbh/pages/frontline/shows/schools/etc/guide.html (accessed August 17, 2011).

Kish, L. 1965. Survey Sampling. New York: John Wiley & Sons.

Spirtos, M., P. O’Mahony, and J. Malone. 2011. “Interrater reliability of the Melbourne Assessment of Unilateral Upper Limb Function for Children with Hemiplegic Cerebal Palsy.”American Journal of Occupational Therapy, 65 (4): 378–383 http://ajot.aotapress.net/content/65/4/378.full (accessed August 15, 2011).

Stattools.net n.d. “IntraclassCorrelation for Parametric Data: Introduction and Explanation.” http://www.stattools.net/ICC_Exp.php (accessed August 15, 2011).

Yaffee, R. A. 1998. “Enhancement of Reliability Analysis: Application of Intraclass Correlations with SPSS/Windows v. 8” http://www.nyu.edu/its/statistics/Docs/intracls.html (accessed August 15, 2011).


Kathleen Langan Pusecker is the director of the Office of Educational Assessment; Manuel Roberto Torres is a research analyst in the Office of Educational Assessment; Iain Crawford is the acting chair of English and associate professor; Delphis Levia is an associate professor of geography; Donald Lehman is an associate professor of medical technology; Gordana Copic is a graduate student--all of the University of Delaware.

Previous Issues