Law firms are throwing legal spaghetti at the wall to take down gen-AI, but judges are so far unimpressed

From The New Publishing Standard:

Law suits against AI companies abound, and no question there are some valid issues that need settling in court, but already it’s beginning to feel like lawyers are just throwing spaghetti at the wall and hoping some strands stick.
Is this really what the publishing industry wants or needs?

Over at Publishers Weekly this week, Andrew Albanese summarises two on-going law suits against the alleged AI copyright thieves, and in both cases a judge has thrown out parts of the claims because they have no merit.

While the judge in one case has left the door open for revised claims – perhaps a nod to the fact that the law as stands was never written with AI in mind – the quick dismissal of some of the claims is a severe blow to the many in the AI Resistance camp who are citing as fact allegations of copyright theft, despite, as Albanese notes, many lawyers stating well in advance that the claims were not well-grounded in law.

From PW back in July:

Multiple copyright lawyers told PW on background that the claims likely face an uphill battle in court. Even if the suits get past the threshold issues associated with the alleged copying at issue and how AI training actually works—which is no sure thing—lawyers say there is ample case law to suggest fair use.”

PW offers several examples of why, that in the interests of the fair use clause I’ll leave to you to click through and read, and instead conclude the summary of that PW article with this quote:

 ‘I just don’t see how these cases have legs,’ one copyright lawyer bluntly told PW. ‘Look, I get it. Somebody has to make a test case. Otherwise there’s nothing but blogging and opinion pieces and stance-taking by proponents on either side. But I just think there’s too much established case law to support this kind of transformative use as a fair use.’ “

The July lawsuit came under scrutiny from TNPS at the time.

. . . .

The proposed class action suit before Chhabria was filed on July 7 by the Joseph Saveri Law Firm on behalf of authors Christopher Golden, Richard Kadrey, and comedian Sarah Silverman, just days after the Saveri firm filed a similar suit on behalf of authors against Open AI, with authors Paul Tremblay and Mona Awad as named plaintiffs.”

. . . .

In each case the law suits make the spurious claim that AI is generating writing in the style of an author or providing in-depth analysis of a published book, and that it does so by illegally copying an original work for its “training.”

For anyone who isn’t irrationally opposed to the very concept of AI and therefore clutching at any straw to attack it, the idea that it is a crime for an author to write in the style of another is as laughable as the idea that an author who learned their trade by reading other authors’ books has committed a crime.

What next? A lawsuit claiming an author has no spelling mistakes so they must have plagiarised a dictionary?

Link to the rest at The New Publishing Standard

17 thoughts on “Law firms are throwing legal spaghetti at the wall to take down gen-AI, but judges are so far unimpressed”

  1. Worth looking at this in the context of the Universal lyrics lawsuit. As there, the dataset is much smaller and more specific and so “overfit”, as in the exact copying, is a lot more apparent. However, articles like this fundamentally misunderstand that all genAI models, visual or textual, aren’t functioning in any different way from Anthropics model. They still “fit” in the exact same way to the prompt, it’s only that their datasets are so large as to simply be obfuscating the original sources. Effectively, they infringe from many more sources, simply to cover up the infringement. It’s really unfortunate that these marketing points, such as “luddite” and “learning” are being repeated uncritically even in legal circles. While experts in the technology, whose research is in fact used in it, like Francois Chollet are being ignored – GenAI models are in his words “Interpolative Search Engines”. They are next token probability engines, completely determinative, nothing more. And those seeking fair legal redress shouldn’t be demeaned by an arcanists manipulation to the contrary.

  2. Wait. You started with people copying others’ works to make software, and somehow you ended up with people writing in someone else’s style.
    Are AI developers writing in other people’s styles?
    Or are they copying their works without reading them to build their software? (Hint: This is literally what happens.)

  3. Neither.
    Leaving aside all the legal precedents and consitutional limits of copyright that make the infrigement claims null, there is a fundamental misunderstanding in the idea that the LLMs *copy* content.
    They don’t.

    The long, multistep process to produce a LARGE LANGUAGE MODEL doesn’t “suck up” online content so it doesn’t actually contain copies of *anything* it “sees”. Rather, the internet crawlers “look” at all non-shielded data (by paywall or robots.txt), analyze it on the fly, looking for *relationships* between symbols (text, glyphs, pixels, etc) that it compiles into a training dataset. The training dataset is then fed to a neural network (which may be software or custom hardware) which, depending on its design and instructions organizes the relationship dataset and the query software that mines that dataset to assemble a reply based on the relationships (plural) linked tied to the symbols in the submitted query. And, because there is nothing even vaguely resembling intelligence or comprehension in the software, the output has no direct link to the original material.

    An example: you ask the chatbot “Who wrote the best selling book titled ALONG CAME A SPIDER?”
    The software might return the string “James Patterson”.
    But to do that it doesn’t need to know:
    – what writing is
    – what a book is
    – what is bestselling
    – who or what a James Patterson is

    All it needs to know that the string “ALONG CAME A SPIDER” relates to the strings “book”, ” Movie*, “bestseller”, “James Patterson”, ” Morgan Freeman”, and thousands of other strings, each it with thousands of relationship strings linked to it. The system then weighs all the relationships according to the input string and discards the links that don’t strongly relate to the input–movie and Morgan Freeman only weakly relate to bestseller and write–but James Patterson relates to both as well as the string. If the software properly weighs “bestseller” and “book” it will reply “James Patterson” but if it doesn’t or if the query only asks “who wrote “ALONG CAME A SPIDER?” it might reply “Marc Moss” instead. He wrote the screenplay for the movie, which grossed $100M+ in 2001 which more recent and likely more than the book, first in a series so a “valid” answer is some internal context. Such mismatched contexts are what are called “hallucinations”.

    That chatbots can and do misweigh relationships would not be possible (in this case) if it actually had access to the text of the book and especially the cover.

    In addition, folks, do you have any idea how much data the internet holds? Hint: In 2020, the amount of data on the internet hit 64 zetabytes. A zetabyte is about a trillion gigabytes.

    GPT 4 was only trained on a database of 100 trillion elements which is about a billion times smaller. And it still uses up absurd amounts of computing resources to answer one query.

    It is physically impossible for a chatbot to access a full copy of everything it deals with. The tech does not exist and may never exist; the amount of data grows faster than the hardware capacity to use it.

    So not only do LLMs not contain copies of anything, they can’t. Not if they’re going to do anything useful with it.

    Now, once that myth is disposed of, start researching the 30 years of internet law and a few hundred years of copyright precedents…

    …or just go back ten years to the Author Guild lawsuit against Google for scanning and indexing books. See how that one turned out.

    The OP is 100.0000% correct.
    Whoever is advising the plaintiffs is taking them for a ride.
    (Or totally incompetent.)

    Hence, this:

    • I suppose we might ask
      1) show me the copy of The Great American Novel in the code, and
      2) show me a copy of The Great American Novel in the system’s files.
      3) So, where’s the beef?

      A memory dump of a neural net is absolute mush to a human no matter how the data is formatted. But a memory dump of a novel can be read by a human.

    • This is flagrantly incorrect on just about every level on how the systems operate.

      But, to address the point on which all the rest is based – During the initial processing phase, all information for creating the training datasets is downloaded, hence copied. Groups such as Laion and the non-profit arm at OpenAI used a special non-profit research exemption under copyright law to allow for their processing work (a largely manual process where sources are ranked for quality and where they are meant to delete it afterward, which it appears they did not do according to reports). However, both were funded and the results used by for-profit entities, Stability AI for the former, and the for-profit side of OpenAI for the latter. Not even ML companies claim otherwise (feel free to search for Laion and laundering, or OpenAI and Kenyan workers as keywords for sources on that), so any assertion otherwise is just plain “hallucinated” misinformation.

      In regards to containing the information, refer to experts such as Francois Chollet whose research literally helped build the models, especially in regards to how statistic aggregation and an instruction set are effectively just very high compression (of the pared down set produced during processing). “Overfit” as they call it, is simply when textual or visual data sources were either limited or end up being weighted high enough in the “latent space” that they don’t have enough additional sources to interpolate with and therefore produce something easily recognizable as a single original source. That’s why the Universal Music lawsuit found it so easy to find direct examples (a niche dataset) and why various independent researchers (such as the University of Maryland team found shortly after Stable Diffusion released), have found constant examples of near exact matches with dataset sources. Machine Learning of this nature is a determinative system, the legal arguments presented here and in the article are relying on obfuscation and mysticism to try and convince others they operate in a completely different way than they actually do. They are interpolative search engines, that’s it.

      • Explain how spiders fit your theory. Browsers. Digital cameras of all stripes. They are legal worldwide and have been for decades. Your definition of “copy” does not hold in court, IT or the real world.

        It’s not obfuscation if the dataset does not contain a full literal (or even encrypted) copy but relational data of the elements in the sample. In-transit images are not legally copies because they are snipets and not retained. Nor distributed. Distribution is the heart of copyright both under fair use and fair dealing. Permision to see or “see” is not required.

        And relying on the Universal case willingly neglects that that database served out the actual lyrics.

        You are betting everything on a definition of “copy” that is as valid as the IA “controlled digital lending”.

        • All your examples have exemptions for their purpose, none of which my comment negates nor even mentions. As already pointed out, they used one such fair use/fair dealing exemption as a form of copyright laundering when it was actually for a for-profit purposes.

          Even the statements of those parties involved in the processing stage (again, search those keywords if you don’t believe it) disprove your comment. That’s not redefining copying, it’s just that your understanding of the technology is so incorrect you believe it has to be different – it’s tilting at windmills.

          It’s utterly bizarre that AI proponents are literally arguing against the assertions made by the experts whose research is part of the technology, as if Chollet is somehow wrong about his own field and his own papers.

            • If only there weren’t astroturfing lawyers paid by Google et al misleading the public (see Chamber of Progresses bona fides for example) and buying articles left and right. If only there were judges who had a technical background or transferable skills overseeing these cases, or who allowed them to get far enough that we could hear testimony from those who helped invent it all and who are even cited by the researchers in these companies. If only there weren’t so many easily duped commentators on the internet who want us to treat technology like mystical voodoo because they have trouble parsing abstract concepts.

              But, sadly, reality is often disappointing.

            • There might be less noise if folks bothered to understand tech basics: Like streaming, which they may or not be using.
              And that is the exact same thing as an AI training spider:

              1- A producer uploads a video file to youtube servers.
              2- A viewer activates the link to the file and the server sends a bit by bit stream of data that the browser analyzes, assembles into a video frame, and presents while receiving more data. It replaces the first frame with the next. Over and over until the server reaches the end of file. The entire file is viewed but at no time did the client hold a *copy* of the full file, just a stream of data packets.

              We’re long past 2005 when watching an oline video meant *downloading* the whole thing, often overnight, to watch it later.
              (And happy to do it, too.)

              Internet spiders/crawlers do what the youtube client (app or browser) does: it links to an *open* server, analyzes the data stream the *server* offers up, and feeds it to its “viewer”. For search engines, the data the crawler is looking for is specific key
              words; for AI training it is looking for the relationships and *usage* of words.

              In neither case is the server data actually copied into the output database.

              The data itself is *worthless* to the database, only its online location (for search engines) or word use relationship (for LLM trainers).

              The acronym itself explains it: Large Language Model. It is a Large LANGUAGE MODEL. It is all about the *language* not the words or phrases or spelling.

              Not only are the misguided suits headed into the brick wall of technical reality, they are angsting over last year’s process. Training crawlers are passe.

              We’re back in the chaos of the ’90’s: new models are evolving monthly and the newer models don’t need online crawlers. The newer models use internal proprietary data (where the *real* money lies) for private, targetted applications or, even more fun, use multiple older models and new techniques to create smaller, more efficient models that don’t use as many expensive resources and are at least as good, but ” smarter” and cheaper. (Prepare for the next wave of AI Angst over Q*. A story for later, I’m sure.)

              There’s this *open source* model Microsoft showed off this week, ORCA 2.0:


              “The models come in two sizes, 7 billion and 13 billion parameters, and build on the work done on the original 13B Orca model that demonstrated strong reasoning abilities by imitating step-by-step reasoning traces of bigger, more capable models a few months ago.

              “With Orca 2, we continue to show that improved training signals and methods can empower smaller language models to achieve enhanced reasoning abilities, which are typically found only in much larger language models,” Microsoft researchers wrote in a joint blog post.

              “The company has open-sourced both new models for further research on the development and evaluation of smaller models that can perform just as well as bigger ones. This work can give enterprises, particularly those with limited resources, a better option to get to address their targeted use cases without investing too much in computing capacity.”

              Note the version. The first version was released less than six months ago. And since this *is* Microsoft, who traditionally take three tries to get a product “right”, they’ll have the third version, ready for commercial use by next spring. 😉

              And since it is open source, every well funded researcher can look at it and learn from it, even use it to train *their* models. We are now deep in the era of software writing software.

              It looks to be about as good GPT4 at the things that matter to it, so we can look forward to even smaller more focused models coming *inside* PC apps.

              AI chaos is here.

              TL:DR Expect way more AI models, everywhere.

              (Cue up a Butlerian Jihad.) 😉

              • Interesting idea. If you don’t have a file, is it possible for you to copy it?

                We might also expect huge leaps in the computing power bringing AI to more people and organizations.

              • You may want to look up how web browser caching works.

                But, for the final time, both diffusion models and all LLMs have already made clear they required copying the information and then engaged in manual parsing of it for the initial processing – Only after that, is “training” done, which is where tokenisation and relationships are established (probability), and which produces the latent space database it will draw from (aggregation).

                Open a search tab, type in OpenAI and Kenyan workers and looks for news articles. You can also look up LLMs and the “books3” dataset. I have no idea why you insist on rewriting reality/history, but it’s as bizarre as arguing you know better how the technology works than leading figures in the field who helped create the technology.

                To address your example – Any dynamic or tuned models are simply additions or different weightings/training approaches over top of originating cores, which is still based off the original datasets. None of these model types work without a massive amount of original data – this is again, literally something the companies admit to. It’s even in their justifications in the submissions to the copyright office. That’s all your link is too, Orca 2 is just a tuned version of LLAMA-2 (check HuggingFace – it’s clearly listed there) which also used books3. The parameter difference is just a trimming of the aggregate set for a specific purpose with different weighting and Microsoft specifically states it’s still bound by LLAMA-2’s limitations, it’s not a new model, or rewriting, or anything remotely like that.

                This is why not treating technology with mysticism is so important, as it completely distorts reality for hype/PR.

Comments are closed.