OpenAI disputes authors’ claims that every ChatGPT response is a derivative work

This content has been archived. It may no longer be accurate or relevant.

From Ars Technica:

This week, OpenAI finally responded to a pair of nearly identical class-action lawsuits from book authors—including Sarah Silverman, Paul Tremblay, Mona Awad, Chris Golden, and Richard Kadrey—who earlier this summer alleged that ChatGPT was illegally trained on pirated copies of their books.

In OpenAI’s motion to dismiss (filed in both lawsuits), the company asked a US district court in California to toss all but one claim alleging direct copyright infringement, which OpenAI hopes to defeat at “a later stage of the case.”

The authors’ other claims—alleging vicarious copyright infringement, violation of the Digital Millennium Copyright Act (DMCA), unfair competition, negligence, and unjust enrichment—need to be “trimmed” from the lawsuits “so that these cases do not proceed to discovery and beyond with legally infirm theories of liability,” OpenAI argued.

OpenAI claimed that the authors “misconceive the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence.”

According to OpenAI, even if the authors’ books were a “tiny part” of ChatGPT’s massive data set, “the use of copyrighted materials by innovators in transformative ways does not violate copyright.” Unlike plagiarists who seek to directly profit off distributing copyrighted materials, OpenAI argued that its goal was “to teach its models to derive the rules underlying human language” to do things like help people “save time at work,” “make daily life easier,” or simply entertain themselves by typing prompts into ChatGPT.

The purpose of copyright law, OpenAI argued, is “to promote the Progress of Science and useful Arts” by protecting the way authors express ideas, but “not the underlying idea itself, facts embodied within the author’s articulated message, or other building blocks of creative,” which are arguably the elements of authors’ works that would be useful to ChatGPT’s training model. Citing a notable copyright case involving Google Books, OpenAI reminded the court that “while an author may register a copyright in her book, the ‘statistical information’ pertaining to ‘word frequencies, syntactic patterns, and thematic markers’ in that book are beyond the scope of copyright protection.”

“Under the resulting judicial precedent, it is not an infringement to create ‘wholesale cop[ies] of [a work] as a preliminary step’ to develop a new, non-infringing product, even if the new product competes with the original,” OpenAI wrote.

In particular, OpenAI hopes to convince the court that the authors’ vicarious copyright infringement claim—which alleges that every ChatGPT output represents a derivative work, “regardless of whether there are any similarities between the output and the training works”— is an “erroneous legal conclusion.”

The company’s motion to dismiss cited “a simple response to a question (e.g., ‘Yes’),” or responding with “the name of the President of the United States” or with “a paragraph describing the plot, themes, and significance of Homer’s The Iliad” as examples of why every single ChatGPT output cannot seriously be considered a derivative work under authors’ “legally infirm” theory.

“That is not how copyright law works,” OpenAI argued, while claiming that any ChatGPT outputs that do connect to authors’ works are similar to “book reports or reviews.”

Link to the rest at Ars Technica

As PG has mentioned previously, he believes that using a relatively small amount of material protected by copyright along with far larger amounts of material not subject to copyright protection for the purpose of training an AI and not for the purpose of making copies of the copyrighted material qualifies as fair use.

Even absent fair use, such use is not a violation of copyright protection because the AI is not making copies of copyrighted materials.

PG has mentioned other analogies, but one that popped into his mind on this occasion is an author who reads hundreds of romance novels for the purpose of learning how to write a romance novel and then writes a romance novel using tropes and techniques that many other romance authors have used before.

From Wikipedia:

Precursors of the modern popular love-romance can also be found in the sentimental novel Pamela, or Virtue Rewarded, by Samuel Richardson, published in 1740. Pamela was the first popular novel to be based on a courtship as told from the perspective of the heroine. Unlike many of the novels of the time, Pamela had a happy ending.

. . . .

Women will pick up a romance novel knowing what to expect, and this foreknowledge of the reader is very important. When the hero and heroine meet and fall in love, maybe they don’t know they’re in love but the reader does. Then a conflict will draw them apart, but you know in the end they’ll be back together, and preferably married or planning to be by page 192.

Joan Schulhafer of Pocket Books, 1982

A great many of the most financially successful authors PG knows are romance authors.

16 thoughts on “OpenAI disputes authors’ claims that every ChatGPT response is a derivative work”

  1. AI today is a Roschach test that brings out people’s innermost anxieties.
    The times are tough and getting tougher but it is the educational systems that have failed to properly educate the masses for the technological age, which is already over two generations old, that are to blame for the tech illiteracy of the masses, allowing urban myth and scammers free reign (like the idiots peddling “alien mummies” to the Mexican legislature this week.)

    People will believe all sorts of irrationalities (UFOs, lizard people, “the government is here to help!”):

    Particularly in the US, public education has become past-focused instead of future leaning. Nothing good will come of it.

    Fortunately, the alternatives are taking over.
    The future is coming whether they’re ready or not.

  2. unfortunately for the argument that the AI is not making a copy of the copyrighted material, the courts have taken the stance that all computer processing of copyrighted material requires copying it (from disk to memory, from memory to disk, from one computer to another, from one place in memory to another), so this means that any use of digital works makes hundreds or thousands of copies.

    • But will the courts actually regard these thousands of copies as constituting a breach of copyright or will fair use (or whatever) provide cover for the AI’s actions? Whenever I do a search in Google Books it feels like their – much more permanent – copies don’t matter.

    • I keep hearing this.
      Which court? When?

      Seems to ignore both transformational (any “copies” a computer manipulates is anything but human readable and incapable of substituting for “the precious”) and de minimis (given the transcient evanescence of said representation).
      Technical illiteracy at work.

      By that “logic” FAX machines would be illegal.

      Hopefully it won’t take ten years to overcome the FUD and arrive at fair use this time.

        • ahh, but you see, digital copies are different. they are perfect duplicates and don’t degrade and therefor worse than any other copy

          or so the ‘logic’ goes.

          If it’s not clear. I consider this line of argument stupid and counterproductive. But I’ve seen it too much over the years.

    • I’d like to see cites for these copyright cases, D.

      Plus, my understanding is that the process of digesting documents for an AI involves copies that exist as copies for a few milliseconds (or less) before they’re chopped and diced into a zillion pieces for processing.

      No human being could ever view the original copyrighted documents after they get sucked into the computer.

      Additionally, no copy of copyrighted material remains on the very fast computer once it’s been processed.

      Plus, as I’ve mentioned before, I think there’s a separate and strong fair use argument to be made.

      • look at all the piracy decisions that have turned software from something you purchase to something that you rent.

        All the way back to Autodesk with their lawsuit saying that they could forbid you from selling a copy of their software to someone else because of the copying involved isn’t like selling a physical copy, it’s an ongoing license to allow copying to happen under specific conditions

        The inability to sell music files or even to rip your own CDs and use the digital versions are limited because you can’t actually sell something that can be used without further copying.

        • AI training is not CD ripping.
          Those are full copies.
          The output can and is *distributed* as a substitute for the full product. (Yet nobody in the US has challenged the CD ripping itself, only the *distribution* on the content. Personal use remsins undetermined but presumed fair use. That *is* relevant to AI training.)

          More relevant, thojgh, is the near-eternal google scanning case that was finally deemed fair use because google wasn’t distributing the database they created, only quotes.

          “AI” training doesn’t even do that.

          Now, *this* is debatable and is being debated:

          Very different story because it is the training database that is being distributed, yet the odds are that if they went after somebody with competent lawyers the activists would get shut down, too. The training database is not the same as a torrent full of ebooks that substitute for the original.

          They fail to understand that “in the style of” is not a substitute for actual published texts. That is not how copyright works in the land of fair use. Copyright only protects specific products from being replaced with full copies of itself. Other uses of the product are deemed Fair Use as long as they meet the court-established tests.

          In the UK they had a dogfight over a proposed law adding a copyright exception that explicitly *allowed* AI training. Necessary because they don’t have fair use and the modern UK is hostile to technological change, as demonstrated by their CMA contorsions trying to block the MS ABK buyout over, of all things, the trivial cloud gaming market. (Back in the US, their FTC activist partners were laughed out of court in a week.) In the UK it was musicians screaming bloody murder about AI training. No fear we’ll see AI software coming out of there any time soon.

          As I said, protesters are either tech-illiterate or luddites and in thevUS at least they still don’t call the shots, much as the activists try.

          • > AI training is not CD ripping.
            > Those are full copies.

            but to train the AI, you have to make full copies of the material into memory (and probably to disk as well) for the training software to read.

            I know it’s stupid to call this sort of thing copying for purposes of copyright protection, but that’s what’s been done to block the sale of digital books, music, movies, and software by people.

            • But it’s not remotely the same.
              And the legal principals aren’t either.

              And no they do not work the say.
              Ripping and scanning books are format-shifting and *holding* for distribution human usable content.
              AI “learning” does none of that.

              That is just not how it works.
              It never holds a full book at once in its memory.
              That woud be wasteful.

              It works word by word, sentence by sentence, and analyzes every similar use of the same words. Once it analyzes the data, it files what it finds (“learns”) in a database, erases the sample, and moves on. Processing a 100 trillions bits in a few months, at a rate of a few million bits a minute(?). It never holds the full book and the snippets only stay in there for a faction of a second and never in humsn tradable form. Again, a waste of expensive computer resources.

              Understand: it isn’t about the book, its story, or its ideas; it is about the *language* and how humans use it. When you ask it “in the style of Dorothy Sayers” it looks up the words Dorothy Sayers and all the words that uniquely or primarily correlate to those words. It doesn’t know beans about the nature of “Dorothy Sayers” or for that matter “human” or “author” just the other words that in some form relate to that text string and it assembles together a string of words that correlate to the prompt. It doesn’t always get in right because in the learning process each word/sentence is linked to more than one other string and rach link has a weight value attached.

              Prompt it with “grok” and it will instantly relate it to “Heinlein”, “Motie” to “Pournelle” be solely as words, not as ideas or concepts. And if the weights add up wrong, you get a “hsllucination”, an improper string.

              That is why they are called LARGE LANGUAGE MODELS.

              Again, there is no intelligence at work, no understanding, no thought. Just very fast link sorting and text string manipulation. Just absurdly fast processing of a ginormous database. And, to bring in yet another legal precedent, the data in a database is not copyrightable, only the organization. Which is why it is doubtful the Books3 training set is actually in violation of anything, even under european law.

              It’s really just an enormous dictionary which is why it’s first significant use is abstracting the words in documents.

              • the process of creating that dictionary requires duplicating the works. I’m not saying that the dictionary is the works

        • On the UK trying to allow their companies to develop their own AI tech:


          Instead, they ended up creating an entirely new bureaucracy to regulate AI. (Yeah, right.) And this under the allegedly pro-business party. Lord knows what will come when they are voted out.

Comments are closed.