Should you trust a computer to grade your child’s writing on Common Core tests?

This content has been archived. It may no longer be accurate or relevant.

From The Washington Post:

Education activists are increasingly becoming concerned about the computer grading of written portions of new Common Core tests. Can a computer really grade written work as well as a human being?

. . . .

On April 5, 2016, the Parent Coalition for Student Privacy, Parents Across America, Network for Public Education, FairTest and many state and local parent groups sent a letter to the Education Commissioners in the states using the federally funded Common Core tests known as PARCC and SBAC, asking about the scoring of these exams.

We asked them the following questions:

  • What percentage of the ELA exams in our state are being scored by machines this year, and how many of these exams will then be re-scored by a human being?
  • What happens if the machine score varies significantly from the score given by the human being?
  • Will parents have the opportunity to learn whether their children’s ELA exam was scored by a human being or a machine?
  • Will you provide the “proof of concept” or efficacy studies promised months ago by Pearson in the case of PARCC, and AIR in the case of SBAC, and cited in the contracts as attesting to the validity and reliability of the machine-scoring method being used?
  • Will you provide any independent research that provides evidence of the reliability of this method, and preferably studies published in peer-reviewed journals?
  • Our concerns had been prompted by seeing the standard contracts that Pearson and AIR had signed with states. The standard PARCC contractindicates that this year, Pearson would score two-thirds of the students’ writing responses by computers, with only 10 percent of these rechecked by a human being.  In 2017, the contract said, all of PARCC writing samples were to be scored by machine with only 10 percent rechecked by hand.

. . . .

On another Pearson page, linked to from the FAQ, called “Scoring the PARCC Test,” the informational sheet goes on at great length about the training and experience levels of the individuals selected for scoring these exams (which is itself quite debatable) without even mentioning the possibility of computer scoring. In fact, we can find nowhere on the PARCC website in any page that a parent would be likely to visit that makes it clear that machine-scoring will be used for the majority of students’ writing on these exams.

. . . .

According to Les Perelman, retired director of  a  writing program at MIT and an expert on computer scoring, the PARCC/Pearson study is particularly suspect because its principal authors were the lead developers for the ETS and Pearson scoring programs. Perelman said:  “It is a case of the foxes guarding the hen house.  The people conducting the study have a powerful financial interest in showing that computers can grade papers.”

. . . .

“Like previous studies, the report neglects to give the most crucial statistics: when there is a discrepancy between the machine and the human reader, when the essay is adjudicated, what percentage of instances is the machine right? What percentage of instances is the human right? What percentage of instances are both wrong? …  If the human is correct, most of the time, the machine does not really increase accuracy as claimed.”

. . . .

Moreover, the AIR executive summary admits that “optimal gaming strategies” raised the score of otherwise low-scoring responses a significant amount. The study then concludes because that one computer scoring program was not fooled by the most basic of gaming strategies, repeating parts of the essay over again, computers can be made immune from gaming.  The Pearson study doesn’t mention gaming at all.

Indeed, research shows it is easy to game by writing nonsensical long essays with abstruse vocabulary. See for example, this gibberish-filled  prose that received the highest score by the GRE computer scoring program. The essay was composed by the BABEL generator – an automatic writing machine that generates gobbled-gook, invented by Les Perelman and colleagues.

. . . .

In a Boston Globe opinion piece , Perelman describes how he tested another automated scoring system, IntelliMetric, that similarly was unable to distinguish coherent prose from nonsense, and awarded high scores to essays containing the following phrases:

“According to professor of theory of knowledge Leon Trotsky, privacy is the most fundamental report of humankind. Radiation on advocates to an orator transmits gamma rays of parsimony to implode.’’

Unable to analyze meaning, narrative, or argument, computer scoring instead relies on length, grammar, and arcane vocabulary to do assess prose. Perelman asked Pearson if he could test its computer scoring program, but was denied access. Perelman concluded:

If PARCC does not insist that Pearson allow researchers access to its robo-grader and release all raw numerical data on the scoring, then Massachusetts should withdraw from the consortium. No pharmaceutical company is allowed to conduct medical tests in secret or deny legitimate investigators access. The FDA and independent investigators are always involved. Indeed, even toasters have more oversight than high stakes educational tests.

A paper dated March 2013 from the Educational Testing Service (one of the SBAC sub-contractors) concluded:

Current automated essay-scoring systems cannot directly assess some of the more cognitively demanding aspects of writing proficiency, such as audience awareness, argumentation, critical thinking, and creativity…A related weakness of automated scoring is that these systems could potentially be manipulated by test takers seeking an unfair advantage. Examinees may, for example, use complicated words, use formulaic but logically incoherent language, or artificially increase the length of the essay to try and improve their scores.

The inability of machine scoring to distinguish between nonsense and coherence may lead to a debasement of instruction, with teachers and test prep companies engaged in training students on how to game the system by writing verbose and pretentious prose that will receive high scores from the machines.  In sum, machine scoring will encourage students to become poor writers and communicators.

Link to the rest at The Washington Post

PG suggests that Pearson has a powerful incentive to use computer systems to generate test scores. Human graders are expensive, slow, difficult to manage and, most importantly, a drag on profits.

PG also suggests that school systems that will be graded and classified based on their students’ test scores will be powerfully incented to spend class time instructing students about how to write for the standardized tests.

Using the buzzword, “artificial intelligence” is certainly a lovely way to sell machine grading. However, PG also suggests that Pearson is far from a magnet for smart people working in artificial intelligence.

Essentially, Pearson is a big publishing company with more than a few similarities to the large New York trade publishing companies. Companies which are on the forefront of artificial intelligence include Google, Amazon, Microsoft, Facebook, Intel, Baidu and Tencent.

Tech people have become very wealthy working for Google, Amazon, etc., etc. Nobody knows a tech person who has become wealthy working for Pearson (or Simon & Schuster, Hachette, HarperCollins, etc.). In short, these large publishers don’t have the brainpower for cutting-edge AI research and probably don’t have the brainpower to discern between quality AI and junk.

Human graders have their own drawbacks (young attorneys barely out of law school themselves used to be observed on San Francisco cable cars grading written essays on the California bar exams), but there is no technological fog obscuring their reliability the way there is with “artificial intelligence.”

7 thoughts on “Should you trust a computer to grade your child’s writing on Common Core tests?”

  1. AI research has been struggling for decades to get up to the level of artificial stupidity. Only in recent years have a few projects demonstrated the level of an artificial idiot savant.

  2. “Should you trust a computer to grade your child’s writing on Common Core tests?”

    Title be a question so ‘no’ …

    The problem with ‘one test to rule them all’ is that then you have schools teaching the test – after all, that’s the way the schools are graded – by the percentage of their students that pass the test …

    I saw this far too often doing anything in computer tech jobs, new kids coming it that ‘knew the answers’, but had no idea what the answers meant in regard to doing the job. Heck, to be honest many didn’t actually understand the dang question – which meant changing the question got you a now wrong (but the same as last time) answer.

    But doing it other ways costs more money, money that everyone says would be well spent but no one wants to pay for.

  3. I graded standardized test for Pearson for six weeks and then got a better job working at a bowling alley.

    From personal experience, the current state of AI can’t do any worse than the humans now grading exams at Pearson.

  4. At one point Pearson was owned by Random Penguin. I’m not sure if it still is, but…there were linkages between RP, Pearson, and Common Core curriculum/reading recommendations at one time.

  5. As a teacher and fan of computers, I believe there are areas where computer programs can provide effective individualized learning for young children. These areas include spelling and arithmetic where the ‘answer’ is either right or wrong. In these areas, AI can also test effectively.

    Once students progress to subjects that require value judgements, however, it’s time for qualified humans to take over. It’s not rocket science. Sadly the profit principle seems to trump both logic and thousands of years of experience.

Comments are closed.