The Future of Testing

Given how much the rest of education has changed since the middle of the 20th century, it’s remarkable that the model of large-scale student assessment we have today still looks pretty much the way it did back in the nineteen-fifties: a group of kids under careful watch, lined up in rows of seats in a rigidly controlled space, all asked the same questions, each silently bubbling in answer sheets under the same strict time limits.

To be sure, new technologies have been incorporated into standardized testing over the decades: machine scoring, computerized authoring and delivery, computer adaptive testing, technology-enhanced items, automated essay scoring, automated item generation. But these innovations — not all of them widespread; it’s still very much a paper-and-pencil world in most US schools — haven’t really changed the underlying testing paradigm. Whether computer- or paper-based, the tests still are comprised mostly of multiple-choice questions. They still require highly contrived and regimented conditions for administration. They still make use of the same measurement principles and techniques. They still are governed by the values of 20th-century industrialization: speed, uniformity, cost efficiency, quantifiability, and mass production.

This model of testing persists precisely because it so effectively delivers on machine-age promises of reliability and efficiency at a large scale. But these benefits come at a cost: the fixed content and rigid testing conditions severely constrain the skills and knowledge that can be assessed. A battery of multiple-choice and short-answer questions on a timed test may do a pretty good job of evaluating basic skills, but it falls short of measuring many of the most essential academic competencies: sustained engagement, invention, planning, collaboration, experimenting and revising, spit-shining a finished project — the kind of skills developed through authentic, substantive educational experiences.

Standardized testing has not kept up with advances in learning science. It ignores, for example, the non-cognitive skills that research today tells us are integral to learning — personal resilience, for example, or a willingness to cooperate. What’s more, we acknowledge today that students develop their academic competencies, cognitive and non-cognitive, in particular educational contexts in which their own varied interests, backgrounds, identities, and languages are brought to bear as valued resources. Conventional standardized tests work to neutralize the impact of these variables, rather than incorporate them.

We do need, and will continue to need, large-scale assessments, despite the many dissatisfactions we may have with them at present. Classroom assessment by itself doesn’t tell us what we need to know about student performance at the state or national level. Without large-scale assessment, we’re blind to differences among subgroups and regions, and thus cannot make fully informed decisions about who needs more help, where best to put resources, which efforts are working and which aren’t.

The central problem to address, then, is how to get an accurate assessment of a fuller range of authentic academic competencies in a way that is educative, timely, affordable, and scalable — a tall order indeed. Recognizing the limitations of the existing testing paradigm, the Every Student Succeeds Act (ESSA) of 2015 opened the door for a limited number of states to try out alternative models that might eventually replace existing accountability tests. Thanks in part to this opportunity, plus ever-advancing technologies, new ideas are in the works.

Here are some directions in which the future of testing may be headed:

Classroom-Based Evidence. The assessment of authentic classroom work can provide a fuller and more genuine portrait of student abilities than we get from the snapshot view afforded by timed multiple-choice-based tests. Indeed, portfolio assessments are widely used in a variety of contexts, from individual courses to district-level graduation requirements. Historically, however, they haven’t worked well at scale. Experiments with large-scale portfolio assessment in the 1990s were abandoned as they proved cumbersome and expensive, and as states found it difficult to establish comparability across schools and districts.

Hopes for using collections of authentic student evidence in large-scale assessments are being revived, however, as ESSA creates new opportunities for state-level change. The anti-standardized testing group, FairTest, has developed a model to help guide state system innovations toward local assessment of classroom-based evidence. The model folds teacher-evaluated, student-focused extended projects into a statewide accountability system with built-in checks for quality and comparability. FairTest cites programs already underway in New Hampshire and elsewhere as evidence of where this approach might lead.

The FairTest model doesn’t rely on new technologies, but large-scale portfolio assessment potentially becomes more feasible today, compared with the low-tech version in the nineties, thanks to easier digitization, cheaper storage, and ubiquitous connectivity. More than mere repositories for uploaded student work, platforms today can combine creation and social interaction spaces with advanced data analytics. This creates opportunities for assessing new constructs (research, or collaborative problem-solving, for example), gaining new insights into student competencies (e.g. social skills), and even automating some dimensions of portfolio assessment to make it faster and more affordable. Scholar, a social knowledge platform currently in use in higher education, provides a glimpse into the kind of environment in which large-scale e-portfolio assessment might someday take root.

Real-World Fidelity. Another shortcoming of multiple-choice based standardized tests is that that they do not present students with authentic contexts in which to demonstrate their knowledge and skills. More authentic tasks, critics argue, better elicit the actual skills associated with the constructs measured, and thus lead to more valid test score interpretations.

Computer-based tests create opportunities for item types that more closely resemble real-world activities, compared with traditional multiple-choice questions. Technology-enhanced items (TEIs) can, for example, allow students to manipulate digital objects, highlight text, show their calculations, or respond to multimedia sources. While such items fall short of replicating real-world activities, they do represent a step beyond selecting an answer from a list and filling in a bubble sheet.

Many computer-based versions of standardized tests now add TEIs to the mix of conventional items in hopes of measuring a broader range of skills and improving test validity. In truth, however, TEIs bring their own set of test development challenges. Though eager to use them, test makers at this point do not know very much about what a given TEI might measure beyond a conventional multiple-choice question, if anything. Additionally, in their quest for greater real-world fidelity, TEIs can at the same time introduce a new layer of measurement interference, requiring examinees not only to demonstrate their academic ability, but also to master novel test item formats and response actions.

Despite their current limitations, however, technology-enhanced items will likely continue pushing standardized testing toward greater real-world fidelity, particularly as they grow more adept at simulating authentic problems and interactions, and better at providing test takers with opportunities to devise and exercise their own problem-solving strategies. The latest iteration of the PISA test, a large-scale international assessment, simulates student-to-student interaction to gauge test takers’ collaborative problem-solving skills. Future versions will connect real students with one another in real time.

Continuous Assessment. As tests evolve toward truer representations of real-world tasks, they will likely pick up a trick or two from computer-based games, such as Mars Generation One: Argubot Academy or Physics Playground. These games, like many others, immerse students in complex problem-solving activities. To the extent that conventional test-makers learn likewise to engage students in absorbing tasks, they will better succeed at eliciting the kinds of performances that accurately reflect students’ capabilities. When tasks lack relevance and authenticity they work against students’ ability to demonstrate their best work.

In addition to engaging their interest, computer-based educational games can continuously assess students’ performances without interrupting their learning. The games register a student’s success at accomplishing a task; but more than that, they can capture behind-the-scenes data that reveal, for example, how persistent or creative the student was in finding a solution.

As they develop, platforms delivering academic instruction might also automatically assess some dimensions of authentic student performance as it happens, without interrupting learning activities. The Assessment and Teaching of 21st Century Skills project, from the University of Melbourne, provides an example of how an academic platform can capture log stream and chat stream data to model and evaluate student activity. This kind of “stealth assessment” creates opportunities for including non-cognitive competencies — e.g., level of effort, willingness to contribute — in the overall picture of a student’s abilities.

Inclusion. To achieve statistical reliability, conventional standardized tests demand rigorously uniform test-taker experiences. Accordingly, the tests have always had a hard time accommodating examinees with special needs. Education today, however, persistently leads away from uniformity, towards greater inclusion and accommodation of the whole community of learners, including those with various physical, learning, and language differences.

Computer-based testing presents both opportunities and challenges for accessibility. On one hand, special tools, such as magnifiers and glosses, can be built into standard items. On the other, TEI formats using color, interactivity, response actions requiring fine motor skills, and other features can be difficult or impossible for some test takers. Nevertheless, research suggests that, overall, the digital testing environment can improve access to testing for students with disabilities.

Among the challenges to inclusivity in US testing is the problem of evaluating students who are learning English against standards that assume they already have English language skills. According to Professor Alida Anderson of American University, this problem highlights the need for future assessment systems to be more flexible, not only in the design and delivery of test content, but also in the interpretation and use of standards. Towards that end, programs such as the New York State Bilingual Common Core Initiative are developing bilingual standards and learning progressions that align with English language-based standards frameworks. These efforts promise a fairer and more accurate interpretation of test results for more students.

My own company, BetterRhetor, is combining some of the innovations discussed above in an effort to overcome the limitations of conventional testing (see our GoFundMe). Our web-based platform, for use in classrooms, will deliver six-week instructional modules in Writing and STEM. Assessment of student performance is facilitated by the platform and integrated into instruction. The modules will teach, elicit, capture, and assess not only cognitive skills, but also social and personal competencies. Because students engage over an extended period, we’ll be able to supply actionable feedback, as well as indications of progress. Our overall goal is to provide teachers and schools with a highly effective instructional resource that generates a rich portrait of their students’ authentic abilities.

These kinds of innovation will likely require parallel innovations in measurement science if they are to take hold in large-scale assessment. Test reliability, for instance, might be reframed in terms of negotiated interpretations by panels of local stakeholders, instead of statistical correlations among test scores. Determinations of validity may need to consider how well a test elicits fair and authentic performances from the full complement of learners in an educational community. Comparability across schools and districts may need to take into account the degree to which an assessment supports not just institutional needs but also student learning.

Ideally, future forms of large-scale assessment will function as integral dimensions of learning itself, rather than interruptions or intrusions. They’ll both evaluate and reinforce the full array of knowledge and skills required for the successful completion of real academic work in real educational contexts.

Many thanks to Professor Alida Anderson, School of Education, American University, for her insights into inclusive testing.

For more, see:

Stay in-the-know with all things EdTech and innovations in learning by signing up to receive the weekly Smart Update.