Why a Little-Known Copyright Case May Shape the Future of AI

PG thought he had blogged about this case before but couldn’t find any evidence of doing so when he searched TPV.

From Copywrite Lately:

While a flurry of AI copyright lawsuits from prominent authors and artists grab headlines, another case has quietly taken something more important: a head start.

Even die-hard copyright geeks would be forgiven for overlooking a lawsuit first filed over three years ago by information services company Thomson Reuters against AI start-up Ross Intelligence. That’s because the case involves Westlaw, a legal research tool that’s about as sexy as the underwear section in a 1940s Sears catalog. I say this with peace and love as a longtime Westlaw user, but let’s be honest—headnotes and key numbers are simply no match for the likes of Sarah Silverman and John Grisham.

It’s time to start paying attention though, because a Delaware District Court judge just ordered this low-profile AI case to trial, largely denying the parties’ motions for summary judgment on copyright infringement and fair use (read the opinion here). This means that a jury could weigh in on some of the thorniest copyright questions involving artificial intelligence as early as May 2024.

Thomson Reuters v. Ross Intelligence

The issues at play in Thomson Reuters v. Ross Intelligence largely mirror those I’ve discussed in connection with recent class action copyright lawsuits filed against the creators of Stable Diffusion, ChatGPT and other generative AI tools. In a nutshell, plaintiffs allege that Ross hired a third-party contractor to unlawfully copy Westlaw content—including its proprietary Key Number System and case headnotes—in order to train Ross’s own AI-driven natural language legal search engine.

Unlike the creative works ingested by AI tools in the recent lawsuits filed against OpenAI and Stability AI, the copyrights in Westlaw are more limited. Thomson Reuters doesn’t own any of the underlying judicial opinions that make up its database. It does, however, claim copyright in its keynote organization system as well as its original case summaries and headnote descriptions. These “editorial enhancements” are drafted by the company’s attorney-editors in what I’d imagine is the most thankless job this side of working for Louis Litt.

But according to Ross, it wasn’t interested in the Westlaw key numbers or headnotes. Instead, the goal of its system was for users to ask questions and for the search engine to spit out quotations directly from judicial opinions—no commentary necessary. In other words, Ross contends that the output of its tool won’t infringe any original copyrighted material owned by Thomson Reuters, notwithstanding the so-called “intermediate copies” of West’s key numbers and headnotes that may have been made to initially train Ross’s dataset. These copies, Ross claims, are fair use.

In January, Thomson Reuters moved for summary judgment on its copyright infringement claim, and both sides moved for summary judgment on Ross’s fair use defense.

Judge Stephanos Bibas ultimately declined to determine the scope of protection to be given the Key Number System or to decide whether Westlaw’s headnotes added sufficient non-trivial material to the underlying judicial opinions to meet copyright’s originality threshold. While the court did find that Ross committed an act of “actual copying” by scraping and reproducing headnotes during the AI training process, whether that copying constitutes infringement will depend on whether or not the headnotes are protected expression. That issue will be decided by a jury.

The court likewise ruled that a jury needs to decide whether there are substantial similarities in protectable expression (as opposed to unprotectable material) between Westlaw’s headnotes and summaries and thousands of “bulk memos” created by Ross’s third-party contractor to train Ross’s AI tool.

Fair Use

The court found disputed issues of fact on all four fair use factors, meaning that a jury will be tasked with answering most of the questions underlying this key defense.

The Purpose and Character of the Use

Interestingly, the court’s first factor analysis largely focused, not on the commercial nature of Ross’s competing tool, but on disputes over whether Ross’s copying was transformative—an inquiry that some observers (but, ahem, not this one) thought would take a backseat following the Supreme Court’s recent Warhol decision.

Judge Bibas noted that whether Ross’s so-called “intermediate copying” (copies made during the input stage of the training process) was transformative would depend on the precise nature of Ross’s actions: “It was transformative intermediate copying if Ross’s AI only studied the language patterns in the headnotes to learn how to produce judicial opinion quotes.” If, on the other hand, “Thomson Reuters is right that Ross used the untransformed text of headnotes to get its AI to replicate and reproduce the creative drafting done by Westlaw’s attorney-editors,” then the copying would weigh against a transformative fair use. This raised a material question of fact that a jury needs to decide.

The Nature of the Copyrighted Work

While declining to definitively rule that Westlaw’s headnotes were too unoriginal to satisfy the second fair use factor, the judge certainly signaled that he didn’t think plaintiffs’ contributions were at the “core of intended copyright protection,” and specifically distinguished them from “traditionally protected materials, such as literary works or visual art.”

The Amount and Substantiality of the Copying

Because it was unclear how much of Ross’s copying was of protectable expression, the court found that a jury would need to decide the third fair use factor too. Interestingly, the court also noted that copying could be deemed insubstantial if Ross’s AI actually works in the way the company claimed—i.e., if the tool outputs only the unprotectable judicial opinion, not any original expression. This suggests that the presence or absence of substantial similarity at the output stage may influence the court’s input stage rulings as well.

The Effect of the Use Upon the Market for the Work

Finally, on the fourth fair use factor, the court declined to decide whether Ross’s use of Westlaw’s material had a “meaningful or significant effect” on the value of the original or its potential market. Focusing not merely on economic effects, but “public benefits” of the copying, the court concluded that a jury would be best situated to answer these questions:

Link to the rest at Copywrite Lately

The OP brought to mind a case decided a very long time ago (BI – Before Internet) that caused PG to write an article for a legal publication. PG’s article was titled “Who Owns the Law?”

One problem with BI writings is that PG has not been able to locate an online copy of “Who Owns the Law?”

Basically, the copyright issue he wrote about BI was more than a little similar to the dispute described in the OP.

In PG’s ancient article, he wrote about West Publishing, now owned by the same Thomson Reuters mentioned in the OP.

Way back when, West was a closely-held and secretive company that claimed broad copyright protection for the volume and page numbers universally used by lawyers and judges to identify state and federal court opinions West published in printed form.

Here’s an example of a case citation:

Stearns v. Ticketmaster Corp., 655 F. 3d 1013 (9th Cir. 2011)

West assigned the 655 F. 3d 1013 portion of the citation. (Translated, it means the volume (655), reporter (F. 3d, which is an abbreviation for Federal Reporter, Third Series) and page number in volume 655, (1013) where the printed case may be found.

(The Federal Reporter series of books is reserved for decisions from the various United States Court of Appeals, the second-highest courts in the United States. 9th Cir means the decision was handed down by the 9th Circuit Court of Appeals. There are twelve regional circuits that cover the United States. The 9th Circuit is geographically the largest of the circuits by a large margin. It includes the states of California, Arizona, Nevada, Oregon, Washington, Idaho, Montana, Alaska and Hawaii. The 9th Circuit also includes Guam, and the Northern Mariana Islands.) (You’ve taken your first steps toward mastering legal research.)

West’s copyright rationale was that the company fixed the sort of typos and citation errors that were embarrassingly common during those times before spell check. West further added page numbers to the thick books containing lots of court opinions that the company printed.

West also added a short summary describing what the court case was all about. West also had (and may still have) an enormous outline of the law, which it called the West Key Number System. Its attorneys would go through each case and identify portions that correlated with its Key Number System for other court cases.

From an attorney’s point of view, if you found a case opinion similar to the one you were working on that included a West Key Number citation, you could look up that Key Number Citation and, hopefully, find a number of in-state and federal case opinions addressing issues you were working on at the moment. In some instantiations, the West Key Number index would also show you case opinions in other jurisdictions, which might suggest a line of legal argument for the hometown case you were handling.

The Key Number system was rendered obsolete almost immediately when online search systems were published that allowed an attorney to perform Boolean searches against all decisions rendered by courts in the jurisdiction. As extensive as West’s Key Number system was, it was a blunt instrument when computerized legal research came on the scene.

Additionally, West printed the cases in thick books with page numbers. Lawyers used the West page numbers in their court papers to point the judge to the particular portion of the case opinion they wanted the judge to examine.

This made it more likely that the judge would tell his judicial clerk or secretary to get a copy of a case or, at least, copies of the pages the attorney wanted the judge to read that were buried in a 50-page appellate case opinion.

A technology company then called Mead Data Central, later changed to Lexis-Nexis, referring to the Lexis online research system for lawyers and the Nexis news, magazine, academic journal, scientific publications, etc. repository that had the same computer search capabilities as were used by Lexis.

Lexis basically tore apart every West book full of court opinions, state and federal statutes and other similar collections of federal, state and local government publications that lawyers would find helpful.

After removing the materials West had added to the original government documents, Lexis sent the judicial opinions, statutes, etc., offshore, where a zillion less-expensive fingers and thumbs keyboarded them into the Lexis-Nexis computer systems. The computer systems made the electronic copies of the documents searchable.

West sued Mead Data Central, the owner of Lexis, for copyright infringement.

Mead said these were public documents and West couldn’t assert copyright protection for government documents prepared by government employees.

West said that its case citations and page numbers were copyrighted because West had developed a system of organization and included page breaks and page numbers that weren’t in the original court documents. The fact that inserting page numbers required no creativity activity that Congress intended to encourage with copyright laws didn’t bother West. It worked hard to do a good job, and Lexis shouldn’t be able to steal West’s hard work.

The hometown trial judge bought West’s dubious theory and agreed that the data West used- the words included in court opinions and government documents – were in the public domain and unprotected by copyright law. However, the hometown judge held that “the (West) arrangement and pagination of this public material reflects the skill, discretion and effort of the person crafting the arrangement.”

In other words, West didn’t own the words, but, by working hard to insert page numbers and put the cases from Alaska into a different printed book than the cases from California (“by the sweat of the West’s brow”), West was entitled to copyright to volume and page numbers in its case publications.

Since lawyers had used case numbers and page citations in documents submitted to the court to point the judge to the location of the particular court case the lawyers wanted the judge to consider out of a library full of books containing thousands of court cases (Judges don’t respond well to requests from lawyers to “Look it up yourself.”) West had built an effective monopoly on the way judges and lawyers had established so each group could do their jobs.

Mead appealed, and West, realizing that, sooner or later, some appellate court would reverse earlier court decisions, entered into a super-secret settlement agreement that effectively allowed Mead the right to use West case numbers and page numbers.

As mentioned, PG worked for Lexis a long time ago but never persuaded corporate counsel to let him see a copy of the West settlement documents. PG quickly realized that one reason for the secrecy was that the settlement provided West and Lexis with a shared monopoly on case citations.

9 thoughts on “Why a Little-Known Copyright Case May Shape the Future of AI”

  1. What PG neglects to mention — probably, and understandably, because it would reinforce impressions that he’s no longer a spring chicken — is that these cases were from the 1980s. By the 1990s, they were dead, both explicitly (the Hyperlaw matters, mostly in the Second Circuit) and under the Supreme Court’s interpretation that copyright constitutionally requires actual originality (Feist). The Register of Copyright was never happy with West’s position, either, as it placed “works of the United States government” in copyright despite the Copyright Act’s exclusion of works of the United States government or its employees.

    And that became an issue a decade later regarding attempts to extend expired copyrights via trademark claims, in Dastar — which turned on (a) Gen Eisenhower’s autobiography, which was probably a government work (as it was ghostwritten by active-duty personnel — a mistake that Gen Schwartzkopf and Gen Powell carefully avoided), and (b) lots and lots and lots of government-procured film and photographs, and copies of government documents used as illustrations.

    Then, I’m no longer a spring chicken, either. If I’m a bird at all, I’m a 25-year-old King Vulture.

    • Hmm,verry interestink:

      “The king vulture (Sarcoramphus papa) is a large bird found in Central and South America. It is a member of the New World vulture family Cathartidae. This vulture lives predominantly in tropical lowland forests stretching from southern Mexico to northern Argentina. It is the only surviving member of the genus Sarcoramphus, although fossil members are known.

      “Large and predominantly white, the king vulture has gray to black ruff, flight, and tail feathers. The head and neck are bald, with the skin color varying, including yellow, orange, blue, purple, and red. The king vulture has a very noticeable orange fleshy caruncle on its beak. This vulture is a scavenger and it often makes the initial cut into a fresh carcass. It also displaces smaller New World vulture species from a carcass. King vultures have been known to live for up to 30 years in captivity.

      King vultures were popular figures in the Mayan codices as well as in local folklore and medicine. Although currently listed as Least Concern by the IUCN, they are decreasing in number, due primarily to habitat loss.”

      https://en.m.wikipedia.org/wiki/King_vulture

    • Your comments are all correct, C.

      However, I didn’t want to drive off a great many intelligent visitors with too much detail about legal history of copyright cases.

  2. I want to put in a good word for pre-computer indexing systems. They often were very creative, and could work quite well. The classic library card catalog is a good example. But such systems required much labor to keep up to date and took up a lot of space. Once computers became cheap, the transition to digital was inevitable. Not, however, necessarily with better results for the user. A typical search turns up a lot of chaff to be sifted through.

    • R. – As someone who dealt with library cards in a very large library several centuries ago, the library card could provide a useful way of researching.

      If you found a book that wasn’t quite on point, you could easily look in physically adjacent Dewey Decimal shelving classes and sometimes find a physical book a short distance away book that addressed your needs.

      • Wait a bit.
        That is exactly the kind of stuff a fully realized LLM database front end will make short work of. And it won’t take long. Figure 2025.

        • As long as we’re dealing with LoC cataloging (and not Dewey Decimal Monstrosity cataloging, which places the entire law library between 340.15 and 347.8… but gives “religion and philosophy” from 200.00 to 299.99), I’ll trust a human librarian to determine what belongs nearby on the shelf. Maybe by 2035 LLM database front ends will be able to weight similarities and dissimilarities, but two years is rather overoptimistic.

          It will happen, but just like practical fusion we’re not two years from it.

          • You sure?
            https://www.washingtonpost.com/business/2023/05/10/fusion-power-microsoft/

            Helion is just one of several privately funded efforts making solid (and fast!) progress.
            The aussies are working on an interresting approach, too, using particle accelerator tech. The beauty of Helion’s tech is it creates its fuel and doesn’t use big, expensive heat engines. No irradiated hardware either.

            Fusion will always be 50 years away as long as the only approach is the “ornithopter” tokamak big government approach. But that’s not the only road.

            As for LLMs, while everybody focuses on the internet chatbot foibles, they’re ignoring the inhouse, focused, small database uses like the astronomers use. Today.

            Both MS and Amazon are already selling that kind of “AI” as part of AWS and AZURE services. Private chatbots, if you will, running against corporate databases instead of the internet. No hallucinations there.

            The future is closer than it seems. 😀

            • Polymath Greg Cochran points out that it is a bad idea to count on technology that hasn’t been invented yet. A lot can go wrong between now and the arrival of the Inevitable Future, as the Soviets learned.

Comments are closed.