Peer Review

Addressing the Assessment Paradox

The VALUE initiative's research can contribute to a revolution in assessment, or it can be used to justify the continuation of ineffective practices. The outcome will depend on leadership and willingness to critically analyze the assessment movement’s inability to fulfill its promise to measure learning so that we might improve itself.

Assessment is a paradox: its practitioners want two things that are nearly incompatible. One is agreement among higher education stakeholders about claims like, “Our graduates can write at the college level.” Such agreement—when it can be reached—legitimizes the claims but also represents a compromise between multiple perspectives about what an outcome means in practice. At the same time, we want assessment measures that are grounded in empiricism, so we don’t fool ourselves. Andrew Gelman, who recently (2018) proposed ethical guidelines for statistical practice and communication, raises a similar point: “Consider this paradox: statistics is the science of uncertainty and variation, but data-based claims in the scientific literature tend to be stated deterministically (e.g. ‘We have discovered . . . the effect of X on Y is . . . hypothesis H is rejected’).”

Problems arise when the social meaning of assessment diverges from its empirical merits, as with the publication of Richard Arum and Josipa Roksa’s Academically Adrift: Limited Learning on College Campuses (2011). The book translated standardized test scores into a generalized conclusion, amplified across news outlets, that at the bachelor’s degree level, engineers can’t engineer and accountants can’t count.

On the other hand, socially constructed meaning that ignores reality is counter-productive. My favorite such story is from Behind the Urals: An American Worker in Russia’s City of Steel, in which John Scott describes fixing tractors in the Soviet Union one weekend to help some farmers. After using parts from twelve dilapidated tractors to assemble nine that worked, the farmers were horrified: officially they were accountable for twelve tractors (whether they worked or not was immaterial), and now three of them had vanished!

Institutionalized assessment of student learning resembles Soviet tractor counting. For an assessment director facing accreditation review, it is better to have twelve reports that conform to the reviewer’s (bureaucratic) expectations than to have a few good research projects (Eubanks 2017).

Assessment practice also fails at empiricism. What is typically accepted in assessment reviews has little to do with statistics and measurement. Nor could it be otherwise. The 2016 IPEDS data on four-year degrees granted shows that half of the academic programs (using CIP codes) had eight or fewer graduates that year. Such small samples are not suitable for measuring a program’s quality, given the many sources of variation in student performance. By my calculations, fewer than 5 percent of four-year programs had enough graduates to make a serious measurement attempt worthwhile. It’s safe to conclude that most of the 80,000+ bachelor’s degree programs in the United States are not producing trustable statistics about student learning, regardless of the nominal value of their assessment reports.

The current situation is the worst possible outcome: social acceptability of sloppy data work, creating a decades-long failure to fulfill the assessment movement’s laudable empirical aims.

The Promise and Peril of AAC&U’s VALUE Rubrics

The VALUE rubrics and accompanying validity studies address both the empirical and social aspects of assessment. The learning outcomes and the language used to describe them came from discussions with stakeholders in a credible attempt to define types of student achievement that are not easily measured. The ongoing research on the use of these instruments explores the validity of the rubric ratings for understanding student achievement. As such, the project is commendable in trying to resolve the paradox described above.

Assessment practice can be moved beyond its current bureaucratic impasse by projects like VALUE, but it is also possible that such efforts could worsen the situation instead. Current assessment practice relies on the language of empiricism (e.g. “measuring student learning,” setting benchmarks, looking for statistical differences, etc.) for credibility, even though the conditions do not exist for actual empiricism to be broadly employed. The research done in the VALUE initiative could end up just adding a gloss of science to otherwise shoddy data work done in the name of program assessment (“My measure of program quality is valid because we used a VALUE rubric, even though it’s based on a single student’s performance.”). To have full effect, research like VALUE requires a revolution within assessment practice.

The Future of Assessment

Assessment as a field should take note of the multiple “replication crises” that are ongoing in other disciplines, where a significant amount of published work is being called into question. Often this is due to using too-small data sets combined with publishing conventions that relied on nominal, rather than actual, significance. Gelman’s prosed ethical principles in statistical communication are directed in part at this situation. I believe that the application of his guidelines in assessment could revolutionize practice and satisfactorily resolve the paradox of social-versus-technical meaning. We can have both.

As Gelman puts it, in order to “move away from systematic overconfidence,” he recommends more transparency in data work. Researchers of student achievement need access to the raw data and methods of analysis for studies. To use a cliché, the VALUE initiative can be a “guide on the side” by leading research into the psychometric properties of its rubrics while granting access to others to compare to their own work. How do our internal inter-rater agreement statistics for first-year writing compare to external ones? What is a typical growth curve for undergraduate writing over four years for low-GPA students? These sorts of questions have answers that come from collective pools of data, and VALUE is a natural hub for such a project.

Gelman also addresses the limitations of statistical ways of knowing, and he recommends a culture that embraces criticism of results and methods. The irony of the assessment movement is that it has become fixed and unresponsive to criticism, as public exchanges in the last year have shown. In contrast, in their Change magazine article on assignment difficulty, Daniel F. Sullivan and Kate Drezek McConnell ask the critical question of their assessment data, “Why aren’t the scores of seniors much higher?” (2017). By working through this challenge to validity, they find something interesting and useful about assignment difficulty.

Current assessment practices enforce an unworkable model of too many projects with too little data and methods that practically ensure poor results. The VALUE initiative can partly address the small-data issue through targeted research projects and in support of hermeneutic ways of knowing; see Pamela Moss’s (1994) work on that subject as a guide. However, we still need larger data sets to produce generalizable hypotheses about student learning. To make progress there we need to reboot assessment’s empirical expectations, eliminate the outdated rules, and seek new methods of data gathering that can address both the technical and social requirements.

Assessment in higher education can still fulfill its original promise. But we need to reflect critically on the historical ineffectiveness of the movement in comparison to the ubiquitous success of data mining in other contexts. Piecemeal solutions can’t patch up the flaws; we need a complete rethink. The VALUE approach to assessment can be avant garde in this revolution. It would be a shame if instead it just becomes another checkbox on an assessment report.


Arum, Richard, and Josipa Roksa. 2011. Academically Adrift: Limited Learning on College Campuses. Chicago: University of Chicago Press.

Eubanks, David. 2017. “A Guide for the Perplexed.” Intersection of Assessment and Learning.

Gelman, Andrew. 2018. “Ethics in Statistical Practice and Communications.” Significance 15 (5): 40−43.

Scott, John. 1989. Behind the Urals: An American Worker in Russia's City of Steel. Bloomington, IN: Indiana University Press.

Sullivan, Daniel F., and Kate Drezek McConnell. 2017. “Big Progress in Authentic Assessment, But by Itself Not Enough.” Change: The Magazine of Higher Learning 49 (1): 14−25.

David Eubanks, Assistant Vice President, Office of Institutional Assessment and Research, Furman University

Previous Issues