Peer Review

The Role of Assignments in the Multi-State Collaborative: Lessons Learned from a Master Chef

Perhaps it is due to my distant British heritage, which often manifests as a cynical and dry sense of humor, but for some reason I have recently come to appreciate the work of Chef Gordon Ramsay. Chef Ramsay, who is arguably one of the best culinarians in the world, has held numerous competitions on television in which participants aim to win a prized position as head chef in one of his famous restaurants. On his shows, contestants are often asked to perform a variety of tasks over the course of several weeks. The tasks include such challenges as preparing a signature dish, blind taste tests, reproducing a meal, and turning “leftovers” into “fine-dining” cuisine. The competition can be brutal and Chef Ramsay is a difficult critic to impress. Although I find his antics generally amusing, and at times controversial, I believe we have much to learn from Chef Ramsay. There are striking parallels between the issues facing his assessment strategy and our efforts to examine student learning in higher education.

I also tend to be a critic. For example, I have been framed within the Chronicle of Higher Education (Berrett 2015; Berrett 2016) as a “supportive skeptic” of the Multi-State Collaborative (MSC). Like most academics, I enjoy the realm of skepticism. We like to question, debate, and critique the quality of evidence or reasoning used to support claims. In fact, I would argue that improvement is at best difficult, and at worst impossible, without some form of skeptical critique. Therefore, embracing skepticism is not our challenge in higher education; instead, we face the challenge of providing an adequate response to the skeptics. We have the task of creating an assessment system in higher education that can reasonably withstand this form of scrutiny. I believe that the MSC project is a positive step in this direction.

The MSC aims to provide a common language of student learning by assessing products that students create as a part of their curricular requirements. To date, ninety-two institutions across thirteen states have participated in the project. To participate, institutions submitted sample student work for scoring by raters who were trained to use at least one of the VALUE rubrics. For several years, my MSC colleagues and I have conducted research related to the importance of intentional assignment design in assessment initiatives like the MSC. These themes include the problems of assignment misalignment and the tension between the competing values of generalization and directness when assessing performance. I aim to provide an intuitive overview of these issues by drawing parallels between the processes employed by Chef Ramsay to evaluate the quality of chefs and our efforts to assess student learning in higher education.

Chefs Use What Is Available in the Kitchen

Like Chef Ramsey, who assigns tasks that allow him to make distinctions regarding the abilities of chefs, we too must make decisions about the best way to solicit performance so that students have an adequate opportunity to demonstrate what they have learned. Variations of a task sometimes referred to as a “black box challenge” occur in numerous cooking competitions. The black box challenge asks participants to cook a meal from a set of prespecified, surprise ingredients. For example, participants may be told that they have to use a specific protein, vegetable, and fruit when preparing an appetizer of their choice. Since all participants are provided with the same ingredients, we can infer that differences in the quality of appetizers are due to differences in their culinary ability. Judgments about their ability would be problematic if each chef was given a different quality of protein. This is analogous to having students complete an assignment by responding to the same, or theoretically exchangeable, prompts.

The black box challenge raises two important topics relevant to the MSC project. First, Chef Ramsay is justified when critiquing a competitor who did not use one of the specified ingredients. That mistake is often detrimental to a contestant. The contestant had an opportunity to demonstrate a skill and simply failed to do so. This situation is similar to students who do not display a skill when an assignment explicitly asked them to do so. Secondly, Ramsay would not be justified for critiquing a chef for something irrelevant, such as failing to include fish in the appetizer if the task clearly called for participants to use chicken. In other words, there should be alignment between the task and the criteria used to distinguish quality. The problem of misalignment between assignments and assessment in higher education was reinforced by recent research that I conducted with my former graduate student, Nikole Gregg, which demonstrated that institutions participating in the MSC project need to disentangle whether a score of “zero” is a function of the student or the assignment.

The VALUE rubrics have scores that range from one to four. However, the scores given to student work products actually range from zero to four. According to On Solid Ground, a report on the nationwide use of VALUE rubrics, “Scorers . . . assign a ‘zero’ score if the work product does not show evidence of any of the four levels of proficiency for the dimension in question” (McConnell and Rhodes 2017, 9). Can we attribute the zero score to a characteristic of the student, or is it instead a function of the assignment? Nikole and I examined a series of measurement models using deidentified data obtained from AAC&U in order to investigate how the raters were using the written communication, critical thinking, and quantitative literacy rubrics. The results were amazingly clear. Simply put, the data did not “behave properly” when zeros were included in the analysis. Once we removed the zeros, the “picture” of the data aligned with what we would expect if the raters were using the rubrics properly. This story was similar for each of the three rubrics we analyzed.

To further illustrate this issue, consider an assignment in which a student was asked to write a hypothetical letter to an editor of a college newspaper about a social issue on campus. Assume that the student received a score of zero for the “sources and evidence” element of the written communication rubric. A zero may reflect that the student was asked to provide this information and failed to do so. In this case, the zero can be meaningfully applied to subsequent analyses since it reflects something about student proficiency. However, some assignments may not call for the student to evidence a specific rubric element, thus making the zero a characteristic of the assignment as opposed to the student. In this latter situation, the zeros should be deleted from subsequent analyses since they do not provide meaningful information about the student. However, in this latter case the zero score provides vital information about assignment misalignment. This information can then be used as a faculty development opportunity. For example, assignment development workshops can be held with interested faculty in which participants learn strategies for creating tasks that are capable of soliciting evidence of each rubric element (e.g., Crosson and Orcutt 2014).

In sum, there should obviously be alignment between criteria and tasks. Institutions using data from the VALUE rubrics should make sure that the zero scores are a function of the student and not the assignment. Scores that are a function of the assignment should be deleted before subsequent analyses, though this can also serve as a potential faculty development opportunity. Lastly, just as the quality of food is influenced by the ingredients available in the kitchen, so too are student products influenced by characteristics of the task they are given. The best chef in the world cannot elicit the taste of a filet mignon from a chuck steak.

Even the Best Chefs Lose

Each winner of a cooking competition seems to do poorly on some of the challenges. In the ideal world, a world in which our judgments about a chef’s proficiency was clear, we would have perfect consistency of performance across each of the challenges. If Chef A was the best in the first challenge then she would be consistently the best across all challenges. Unfortunately, this tidy picture is far from reality. Inconsistencies happen and are in many respects an inherent aspect of measurement. However, we hope that performances are not wildly inconsistent. Imagine the confusion of Chef Ramsay if he were faced with contestants who were perfectly inconsistent (i.e., the rank-order of chefs completely changed with each task). Perfect inconsistency would make it impossible to select the best chef. Thankfully, we do not inhabit a world of complete randomness.

The degree of inconsistency that is present is important because it reflects the amount of uncertainty that exists when generalizing from a series of limited performances to an overall metric of ability. Generalization, defined again as our capacity to draw overall conclusions about ability from a limited number of observations, tends to come at the cost of a competing value. Generalization in performance assessment is usually inversely related to directness (or what some people refer to as “authenticity”). To illustrate this point, assume we are solely interested in a person’s ability to cook. We need to make decisions about the best way to solicit evidence of their ability as a cook. One option might be to administer a sixty-item test in which they were asked a series of questions about how to prepare and cook food. An alternative strategy would be to actually observe them preparing a meal, which would be then scored using some kind of rubric or checklist.

Every item on a test is treated as a mini-observation; thus, in this example, we have sixty observations with the test compared to a single observation with the alternative strategy. Generalizations are easier to make with more observations, which in this scenario favors the test. However, the test sacrifices directness since the evidence it provides is further removed from the ability or skill we are ultimately interested in measuring than the alternative strategy. Observing individuals cook a meal has the benefit of being more direct than the sixty-item test, but we end up sacrificing our ability to make generalizations since it is unlikely that I will have the time and resources to observe people cooking a variety of foods across multiple contexts.

The topic of generalization has been notoriously problematic in performance assessment literature (e.g., Lane and Stone 2006). I may therefore have “good” data about the students’ ability to cook a single dish, but I am unsure about how well they would cook other foods. This problem applies to outcomes that we tend to care about in higher education, such as written communication. We often observe a single performance (e.g., a paper) due to limited time and resources. However, even if we made multiple observations, there is a related problem referred to as task specificity. Our judgments about which students are doing better tends to change across multiple tasks designed to measure the same thing. I may think Students 1, 2, and 3 are my best chefs when asked to prepare a rib-eye steak, but I may come to very different conclusions if they prepare shellfish instead. Without actually making these observations, it is very difficult to determine if I would come to the same conclusions about student learning had other choices been made about what tasks to sample. In sum, alternative assessment strategies similar to those advanced by VALUE have the advantage of being more direct than many other strategies, though the directness tends to come at the expense of generalization. So how shall we proceed given these competing values?

Create Your Signature Dish

I have given much thought to these issues in the past few years. As far as I can tell, there are three possibilities for handling the problem of generalization in performance assessment. These possibilities include (1) increasing the number of observations, (2) restricting the domain of generalization, and (3) inferring what is “possible” instead of what is “typical.” The first option is perhaps the most intuitive. We would never attempt to estimate a chef’s overall cooking ability after they prepared a single appetizer, so why would we attempt to estimate a student’s written communication skills from a single paper? If we sampled more papers, we would get a better sense of their written communication. But how many observations are necessary to obtain decent estimates? Previous research suggests that we need anywhere from ten to fifteen observations per student (e.g., Hathcoat and Penn 2012). This leads me to conclude that the first option is unrealistic.

A second possibility is to restrict the domain of generalization. Instead of asking about the number of observations that are needed for generalization, we would now also consider the type of observation that is needed. With respect to the chef example, we may decide to restrict our inference to something more specific, such as an ability to prepare a particular cuisine like Thai, Greek, or Indian. Thus our “domain of generalization” has essentially become smaller by focusing on a specific style of food. Similarly, students may be able to write in one genre but not another (O’Neill and Murphy 2012). With this knowledge, we would be able to exert some controls during our sampling process by obtaining written work from a specific genre, thus making our inferences more limited but specific. This option is more feasible than the first, but it will require us to conduct additional research to better understand the boundaries of generalization since this will likely be different for each learning outcome.

The third option is not free of problems, but it is perhaps the most reasonable solution for the time being. When hiring a head chef, it is critical to get a sense of what is typical for them, which requires multiple observations. But do we need an estimate of what is typical with respect to our learning outcomes in higher education? Perhaps. But perhaps not. What do we learn about a chef when they are asked to create a signature dish? The signature dish does not provide information about what a chef will typically produce. Instead, the signature dish is an indication of what is possible by providing insight into what a chef is capable of creating. In higher education, senior capstone projects and specific forms of ePortfolios are analogous to a chef’s signature dish since the product illustrates what a student can create. In other words, the issue of generalization virtually dissipates once we change our focus from what is typical to what is possible.


Assessment tasks in higher education are similar to the issues faced when attempting to distinguish a good chef from a not-so-good chef in a cooking competition. The product created by a chef is restricted by the tools and ingredients available in the kitchen. The products created by students are also influenced by assignment characteristics. Just as it would be unfair to critique a chef using criteria that fail to align with the task, so too should we avoid assigning numbers to students’ products when they were not given an opportunity to demonstrate evidence of a particular rubric element. Lastly, when assessing student performance, there tends to be tension between the competing values of generalization and directness. There is not an easy solution to this issue, though I remain optimistic about our ability to confront this problem. If we wish to generalize, then additional research is needed to better understand the boundaries within which this is possible. However, we may also opt to investigate possibilities as opposed to generalities by sampling signature dishes.


Berrett, Dan. 2015. “Faculty See Promise in Unified Way to Measure Student Learning.” Chronicle of Higher Education, September 25, 2015.

Berrett, Dan. 2016. “The Next Great Hope for Measuring Student Learning.” Chronicle of Higher Education, October 16, 2016.

Crosson, Pat, and Bonnie Orcutt. 2014. “A Massachusetts and Multi-State Approach to Statewide Assessment of Student Learning.” Change: The Magazine of Higher Learning 46 (3): 24–33.

Hathcoat, John D., and Jeremy D. Penn. 2012. “Generalizability of Student Writing across Multiple Tasks: A Challenge for Authentic Assessment.” Research & Practice in Assessment 7 (Winter 2012): 16−28.

McConnell, Kathryne Drezek, and Terrel L. Rhodes. 2017. On Solid Ground: VALUE Report 2017. Washington, DC: Association of American Colleges and Universities.

O’Neill, Peggy, and Sandra Murphy. 2012. “Postsecondary Writing Assessment.” In Handbook on Measurement, Assessment, and Evaluation in Higher Education, edited by Charles Secolsky and D. Brian Denison, 405−422. New York: Routledge.

John D. Hathcoat, Assistant Professor of Graduate Psychology, James Madison University

Previous Issues