Where is the Accountability in Education?

Learning, PreK-12

For too long, we’ve held just teachers and “teacher quality,” accountable for student outcomes. The glare of the student results spotlight has been intensely focused on teachers alone despite compelling evidence that instructional content and programs is almost as important to student learning. (In a scathing white paper, “Choosing Blindly, Instructional Materials, Teacher Effectiveness, and the Common Core,” Matthew Chingos and Grover Whitehurst blast programs and lack of training and development.)

Not only is it past time to hold content and programs accountable, but we are out of excuses not to. A recent study published by WestEd demonstrates that cost-effective and rigorous evaluations of new programs can now be pursued at any time in any state.

The diversity of content – scope, vehicles, approaches and instructional design – available today is far greater than when teacher-selection-committees simply chose among “Big 3” publishers’ textbooks. Spearheaded by higher standards, as a nation we are aiming to transform student learning outcomes to be much deeper than in the past. Yet, the only things which are going to change are content and programs: the power of the tools and training we provide to our teachers.

Technology-driven revolutions happen when people readily adopt new, immensely more powerful tools to get work they wanted to get done, done. This came naturally for printed books, spreadsheets, email and smart phones. But in education, it’s been extremely challenging to determine what tools actually work, in what contexts and to what ends. This is due to gigantic variability in tool use and school culture and has led, understandably, to skepticism about replicating anecdotal results. Instead we need credible evaluations, diving as deep as randomized student-level. But these are complex, logistically challenging, high cost, and notoriously sparse and slow.

Now that states all have testing systems in grades 3-8, there is consistent grade-level information about proficiency rates and some ability to measure growth rates, enabling any content or program to show its ability to add value, shortly after state assessment results are released each year.

This grade-level evaluation method is straightforward and replicable across years, states and program-types. It also works for every user (school site) in a state, taking into account all real-world variability, easily reporting out on hundreds of schools and tens of thousands of students.

To use the method, the program must be:

  1. In a grade-level (e.g., 3rd-8th) and subject (e.g., math) that posts public grade-average test results
  2. A full curriculum program (so that summative assessments are valid)
  3. In use at 100% of the classrooms/teachers in each grade (so that the public grade-average assessment numbers are valid)
  4. New to the grade, within the first year or two of adoption
  5. Adopted at about 25 or more school sites within a state (in order to provide sufficient “n”).

When these conditions are met, a study that meets What Works Clearinghouse standards of rigor are possible without prior planning, as this WestEd study of ST Math results shows. Every program, in every state, every year, that meets the above criteria can be studied, whether for the first time or for replication. The data is waiting in state and NCES research files to be used, in conjunction with complete and accurate records of program usage.

It may be too early for this level of accountability to be palatable for many publishers just yet. Showing robust, positive results will require true program efficacy. There will be many findings of small effect sizes, many implementations which fail, and much lack of statistical significance. Third-party factors may confuse the results. Publishers would need to report out failures as well as successes. But the alternative is to continue to rely on peer word-of-mouth recommendations

When this research method becomes the industry norm, imagine the renewed competition to improve the tools we give our teachers. We must hold ourselves responsible for imposing accountability in education. Only then will we have a real education revolution.

LEAP Innovations is a Chicago spinout of New Schools for Chicago, an innovative nonprofit sponsoring trials for new EdTech tools—a great example of well constructed trials for new products.

For more, check out:

Andrew Coulson

Andrew Coulson

Andrew R. Coulson has led the MIND Research Institute’s Education Division since 2001, helping devise and execute the strategies and programs to increase the Division’s reach from 30 schools in Southern California to over 1300 in 25 states. Andrew blogs often at ArcSparks.


Steven Hodas /

Andrew, it seems to me that fewer and fewer programs will meet all five of the requirements you cite. Decisions are increasingly being made teacher by teacher and they’re increasingly choosing portfolios of complementary instructional supports rather than single basal programs. Where they are choosing full curriculum programs they’re still deploying all kinds of other supports in ways that would probably be confounding, as you suggest regarding third-party factors.

It would be great to see research skating towards this diverse, third-party-rich school/classroom ecosystem, incorporating the traits of peer recommendations that make them so powerful for practitioners.

Andrew R. Coulson /

Steven, thanks for the great comment. It stimulates a longish response! I am arguing for the usefulness of this method that can provide cost-effective feedback and accountability because I believe there will be plenty of opportunities to use it now and long term; I’m not trying to argue for or against any particular program (and teacher-by-teacher portfolios is a sort of meta-program).

Yes there exists a rich instructional materials ecosystem, and I recognize the teacher-driven portfolio scenario vision. It’s my strong belief that sufficiently knowledgeable, skilled, and motivated teachers could get excellent results with their materials being their voice, a stick, and a wet sand beach. I was fortunate to go through California’s public school system in the pre Prop 13 days, with the very best teachers that a wealthy district could attract to its gifted track student cohort. Standardized assessments (~1 ITBS per year) or which-textbook were rendered irrelevant; teacher created syllabi, lessons and instructional materials (ah mimeographs…) were abundant. I’m sure that today, those teachers of mine would also be sourcing and effectively using and coherently integrating amazing portfolios from some wide array of free digital content.

But I need to address your comment, now in 2015, from 2 specific perspective points: 1) I am always filtering for what is achievable at universal scale across the U.S. in every classroom, and 2) from a perspective of math education.

With respect to the latter, it’s a perspective born from 15 years of focus on math, the most inter-related, crucial-building-blocks, abstract-conceptual subject that ever was made a 10-years-in-a-row-for-every-student requirement. And as such, less amenable to customization by the teacher, in my opinion, than, say, teachers choosing the theme/exemplars for a history or literacy, or even science lesson. 10 years of math lends itself to a potentially consistent approach, strategy, pedagogy, even format across grades and thus across teachers. In contrast, a marked change in teacher-portfolio-driven math content-look, pedagogical approach, and even math learning goals and values (is it “get the answer” or “understand the problem”), from say 5th grade math to 6th grade math, has much potential downside for the average student of both switching costs, and confusion.

The former perspective, filtering for universal scalability, means that I’m interested in what will work for 80%+ of students in 80%+ of classrooms, across the U.S., large/small, urban/rural, homogeneous/inclusionary, new/old. And from that perspective, it is i.m.o. just too much man-hours to expect 80%+ of teachers to successfully navigate, execute with quality, evaluate with results and incorporate individual’s feedback into learning improvement for each individual student if the digital content is teacher-sourced, teacher-integrated, teacher-lesson-planned, and teacher-self-evaluated. Without, e.g., a consistent digital ‘back-end’ to integrate student learning patterns across the school year at the individual level. It’s just too much load to place on every teacher’s back, i.m.o.

In addition, the PLC vision of, say, teachers across a grade sharing their findings and tips of what works, and how and for whom under what conditions, in a cycle of continuous improvement, is greatly synergized by common content, approaches, and tools being used across the grade (my criterion #3 for using this evaluation method). And (discussed at a large local district elementary and research office I visited just this morning), the added value that a district seeks in providing professional development that has awareness of the content/pedagogy being used, also requires some commonality of content/approach across teachers. I see many districts moving away from “we don’t know what programs they’re using out there but at last count there were over 100” to “we’re down-selecting to those few programs that show they can work for our teachers and students.”

So, an option within the future ecosystem which has commonality across teachers within a grade, and can be supported by a publisher with industrial strength tools, is going to be reality, for I think not a small fraction of classrooms. Nothing in the common-across-a-grade criterion prevents any individual teacher from supplementing/enriching (or even redacting) at will! I’m sure that in the WestEd study across 212 schools referenced, nearly every teacher did to some extent! That’s one reason (criterion 5) I’m suggesting >=25 sites for “n”.

Perhaps this clarification of how I see things is moot, in that yes as you point out there is a change from a single-tool-basal (textbook) mentality, to a full curriculum program accompanied by many other supports. But this change can still have any given component be evaluated by this method – if it’s used across a grade-level. E.g. Khan Academy/flipped classroom could be evaluated. CGI could be evaluated. ST Math is itself just one of several supports to a basal, but was readily evaluate-able.

Perhaps an evaluation comparing school cohorts that (a) adopted a teacher-by-teacher portfolio selection (for all teachers in a grade), compared to (b) a more uniform ‘business as usual’ (meaning most likely an adopted (across-teachers) basal (text) plus standard digital content and supports), would be insightful. Though pretty “meta”, this could still meet the 5 criteria and use the method shared in the post.