Authors Join the Brewing Legal Battle Over AI

From Publisher’s Weekly:

Authors have now joined the growing ranks of concerned creators suing tech developers over their much-hyped generative AI technology. And a pair of copyright class action suits recently filed on behalf of authors is raising broader questions about the most effective way to protect creators and creative industries—including authors and publishers—from the potentially disruptive aspects of AI.

Filed on June 28 and July 7 by the Joseph Saveri Law Firm on behalf of five named plaintiffs (Mona Awad and Paul Tremblay in one case, and Christopher Golden, Richard Kadrey, and comedian Sarah Silverman in the other), the suits claim that Microsoft-backed OpenAI (creators of ChatGPT) and Meta (creators of LLaMA) infringed the authors’ copyrights by using unauthorized copies of their books to train their AI models, including copies allegedly scraped from notorious pirate sites. While the authors’ attorneys did not comment for this story, a spokesperson for the firm suggested to Ars Technica that, if left unchecked, AI models built with “stolen works” could eventually replace the authors they stole from, and framed the litigation as part of “a larger fight for preserving ownership rights for all artists and creators.”

The authors join a spectrum of increasingly concerned creators on whose behalf the Saveri law firm has filed similar copyright-based lawsuits in recent months. In November 2022, the firm filed suit against GitHub on behalf of a group of software developers. And in January, the firm sued three AI image generators on behalf of a group of artists. Those cases are still pending—and, like most copyright cases involving new technology, they have divided copyright experts. Those who lean in favor of the tech side claim that using unlicensed copyrighted works to train AI is fair use. Those on the content creator side argue that questions of ownership and provenance cannot simply be waved away without major, far-reaching implications.

Neither Meta nor OpenAI has yet responded to the author suits. But multiple copyright lawyers told PW on background that the claims likely face an uphill battle in court. Even if the suits get past the threshold issues associated with the alleged copying at issue and how AI training actually works—which is no sure thing—lawyers say there is ample case law to suggest fair use. For example, a recent case against plagiarism detector held that works could be ingested to create a database used to expose plagiarism by students. The landmark Kelly v. Arriba Soft case held that the reproduction and display of photos as thumbnails was fair use. And, in the publishing industry’s own backyard, there’s the landmark Google Books case. One lawyer noted that if Google’s bulk copying and display of tens of millions of books was comfortably found to be fair use, it’s hard to see how using books to train AI would not be, while also cautioning that fair use cases are notoriously fact-dependent and hard to predict.

“I just don’t see how these cases have legs,” one copyright lawyer bluntly told PW. “Look, I get it. Somebody has to make a test case. Otherwise there’s nothing but blogging and opinion pieces and stance-taking by proponents on either side. But I just think there’s too much established case law to support this kind of transformative use as a fair use.”

Cornell Law School professor James Grimmelmann—who has written extensively on the Google case and is now following AI developments closely—is also skeptical that the authors’ infringement cases can succeed, and concurred that AI developers have some “powerful precedents” to rely on. But he is also “a little more sympathetic in principle” to the idea that some AI models may be infringing. “The difference between AI and Google Books is that some AI models could emit infringing works, whereas snippet view in Google Books was designed to prevent output infringement,” he said. “That inflects the fair use analysis, although there are still a lot of factors pointing to transformative use.”

Whether the AI in question was trained using illegal copies from pirate sites could also be a complicating factor, Grimmelmann said. “There’s an orthodox copyright analysis that says if the output is not infringing, a transformative internal process is fair use,” he explained. Nevertheless, some courts will consider the source, he added, noting that the allegedly “unsavory origins” of the copies could factor into a court’s fair use analysis.

Link to the rest at Publisher’s Weekly

5 thoughts on “Authors Join the Brewing Legal Battle Over AI”

  1. Here, I disagree with Professor Grimmelmann; we’ve genially disagreed on the point for a decade and a half, and it’s a hard question.

    My position — as was implicit in, and really badly submerged in/unengaged within, the Google Books lawsuits — is that infringement is complete upon making a copy, and that “transformation” relates not to the making of the copy, but the nature of the later output. Professor Grimmelmann’s position is that the later transformation overcomes any intermediate-copy problem. I don’t see a compelling policy or legal determination for either; and a story from Mozart’s childhood (dramatized in part in a few seconds in Amadeus) illustrates the difficulties.

    During the mid-eighteenth century — and remember, this was in continental Europe, where Statute-of-Anne-like copyright didn’t exist yet and whether music was copyrightable in the first place was an open question — liturgical music was closely guarded. No musician had access to the full score, and the “conductor” was almost always the “composer.” Mozart went to two masses, and then created another musical work (if I recall correctly, for harpsichord or fortepiano which he played rather than for chamber strings, and I distinctly recall that there’s controversy over whether Mozart ever himself scored the new work) based on that protected liturgical work. Which got him in trouble with the hierarchy. It does, however, expose the problem we’ve got here: There’s no record of Mozart actually making a copy in the modern copyright sense of the music he heard… and even if there had been, ownership of the copyright would have been dubious (remember, it’s the paparazza photographer, not the celebrity doing embarassing things in the photograph, who owns the copyright — she who fixes it owns the copyright in the fixation). Conversely, as to generative-AI-training systems, current computational models demand that systems create a working internal copy; but this time, it’s not a working internal copy of an unfixed work, but of a previously fixed work (usually text, sometimes images) for which there is the clear existence of a copyright precisely because it is fixed.

    Another, oversimplistic, way to put it is that “Andy Warhol didn’t need to make any unauthorized intermediate copies of the Prince photo in order to make his fifteen prints; each ‘copy’ that was fixed had already been transformed. Conversely, a computational systems doesn’t work like an artist’s mind; the computational system relies upon stepwise operations on multiple intermediate copies. That is, the process is different in the human mind than in the computational system, so conflating the two without detailed analysis is A Bad Idea.”

    I don’t pretend to have an answer here: This is hard. I merely suggest that conflating the entire process into a “transformative use” rather blithely ignores the process of the use itself. The transformation is a process, not a product, and because the process of creating generative AI systems at present relies for one step in that process upon an invalid operator† the entire inquiry is less distinct, less simplifiable, than it seems.

    † This should sound very much like perpetual-motion machines that claim to violate the second law of thermodynamics, and when one breaks the machine down into its individual subparts and subprocesses spotting the error — usually an invalid assumption of frictionless operation, sometimes a blatant division by zero or an operation depending upon a determinable value of infinity or an asymptotic function (like cotan(0)) — is both possible and replicable. Because computational research is changing so rapidly, it’s a lot harder in this context.

    • As you note, this is hard. One can argue that a human, when they read/hear/view a work, has also created an internal “copy” of the work. This “copy” is obviously not illegal.

      However, the “copy” in this case is (except for a few very rare individuals) imperfect, not an exact one. I don’t know how many times I have re-read a book – even one of my favorites – when someone pointed out an error in what I thought was in it.

      But – counterpoint – this is a permanent “copy,” at least until something happens to destroy the human memory. The internal “copy” for an AI (assuming it has one, which is not absolutely necessary – it can be written to analyze as it scrapes from the source, never bringing the entire thing into its own memory) can be discarded in its entirety once the analysis is done and integrated into its database.

      Now, if any of the sources were illegal copies in the first place… That is a different matter. Humans can be hit with copyright infringement claims for the use of pirated content, no matter how they access it. Here, I think the law would be clear, assuming such access by the AI is proven.

      • The 2nd claim is probably more likely to proceed but does not create a precedent, ok they copied books, you owe someone x * large number. but they are merely guilty of copying, unless they try for some claim based on fruit of the poisonous tree.

        Proving the AI has a copy may be difficult, Generative AI’s don’t have perfect copies either, it might be able to give a opening paragraph or a book report summary but so could a person upon reading.

  2. Twenty-five New York City elevator operators filed suit today charging that the Otis Elevator Company had studied their paths and techniques, and had created automated devices that mimic their movements and threaten their livelihoods. Speaking from its Upson Downs Headquarters, an Otis spokesman said, “Elevators aren’t special.”

  3. Legal claims that might (after a few years of wrangling) stand a chance of making it to trial against generative software require a proper fine-grained understanding of the processes and principles. Even the much simpler Google case took years.

    Generative AI?

    Here’s the *simplified* back of envelope explanation:

    Given precedents and, more importantly, how fair use is case-by-case dissected and *output* focused, I side with the OP majority pundits on this. “Copying is copying” is not likely to be much of a case when dealing with one element among a hundred trillion (GPT4), to say nothing of whatever process is dominant by 2025 after a couple years of “internet time” software development and deployed model use and refinement.

    We are approaching the million typing monkeys threshold so any element focused catfight isn’t going to go very far. Fair use, de minimis, and market impact stand in the way.

    I forsee lots of lawsuits but they will be over process and patents, not over inputs.

Comments are closed.