New York Times considers legal action against OpenAI as copyright tensions swirl

From National Public Radio:

The New York Times and OpenAI could end up in court.

Lawyers for the newspaper are exploring whether to sue OpenAI to protect the intellectual property rights associated with its reporting, according to two people with direct knowledge of the discussions.

For weeks, the Times and the maker of ChatGPT have been locked in tense negotiations over reaching a licensing deal in which OpenAI would pay the Times for incorporating its stories in the tech company’s AI tools, but the discussions have become so contentious that the paper is now considering legal action.

The individuals who confirmed the potential lawsuit requested anonymity because they were not authorized to speak publicly about the matter.

A lawsuit from the Times against OpenAI would set up what could be the most high-profile legal tussle yet over copyright protection in the age of generative AI.

A top concern for the Times is that ChatGPT is, in a sense, becoming a direct competitor with the paper by creating text that answers questions based on the original reporting and writing of the paper’s staff.

It’s a fear heightened by tech companies using generative AI tools in search engines. Microsoft, which has invested billions into OpenAI, is now powering its Bing search engine with ChatGPT.

If, when someone searches online, they are served a paragraph-long answer from an AI tool that refashions reporting from the Times, the need to visit the publisher’s website is greatly diminished, said one person involved in the talks.

So-called large language models like ChatGPT have scraped vast parts of the internet to assemble data that inform how the chatbot responds to various inquiries. The data-mining is conducted without permission. Whether hoovering up this massive repository is legal remains an open question.

If OpenAI is found to have violated any copyrights in this process, federal law allows for the infringing articles to be destroyed at the end of the case.

Link to the rest at National Public Radio

As PG has mentioned on a couple of previous occasions, he has doubts about the copyright infringement claims like the Times is asserting because, to the best of PG’s knowledge, no AI stores the original copyrighted works or is capable of reproducing them.

Instead, the contents of the Times plus a huge number of other texts are used to train the AI model, then deleted after training is complete. The AI can then utilize the ingested texts in order to come to an understanding of the meanings of the texts and use that understanding to create new expressions of knowledge as needed to respond to a wide range of queries and commands that individual users submit.

PG doesn’t think the AI can ever recreate the words of the original Times stories. The AI uses the information it has ingested to create new responses to tasks individual users want it to perform.

The analogy PG thinks is correct happens when he reads a story in the Times or elsewhere, then uses that knowledge to answer questions posed by others or to create other writings that don’t replicate the original Times articles and may include ideas, facts, etc. that he has picked up during his extensive reading of a large collection articles from a great many sources.

8 thoughts on “New York Times considers legal action against OpenAI as copyright tensions swirl”

  1. Another hard to parse element that I can’t replicate is that one would have to prove that AI program X actually scanned site Y or Book Z as part of its process. For NYT, that might be easier to prove than for an author or the Author’s Guild to prove their books were included. That they MIGHT have scanned an author’s book, or an Article, doesn’t seem to me to let anyone get over the discovery hurdle.

    That man over there might have diddled my wife, doesn’t mean I can do a DNA sample of their underwear to see if there is evidence of it.

    • Bigger problem: does it matter (legally)?
      The entire publishing industry apparently doesn’t “grok” what 100Trillion pieces of information means and simultaneously forgot about the Google book case.

      They might as well try to charge for photographing at the NYC skyline.

  2. The problem here is that everyone is confusing “information” with “expression.” NYT articles that do not appear on the opinions page are explicitly written (under the paper’s guidelines) to minimize the uniqueness of expression. That’s not to say that there’s never any unique expression — just that the purple prose, the quotable quotes, are for the opinion pieces and letters.

    Copyright is an infringement of original expression. On very, very, very rare occasions, that might look like it relates to facts being discussed or the analysis of those facts — but that is so rare that even in the obvious case of “Why, exactly, did Gerald Ford think he should pardon Richard Nixon?” the expression in Ford’s own (ghostwriter’s) words was the copyright infringement. There weren’t even complaints raised about embargoed-prepublication factual analysis that did not quote the dire 800 words. (See Harper & Row, Pubs., Inc. v. Nation Enters., 471 U.S. 539 (1985).)

    In this instance, if ChatGPT (in the generic sense) assimilates the facts and analyses from the NYT, that’s most emphatically not a copyright infringement… unless ChatGPT did so by making an unauthorized copy, exceeding fair use. (Which it did; that’s how von Neumann-architecture computing systems — which is to say every commercial computer system now in use, including the one you’re using to read this and the one I’m using to compose this — do things.) This is utterly distinct from PG reading the article, taking notes of the relevants facts, and using that to train the extra brain he’s been growing in the storage unit.† When PG did/continues to do that, he’s not making an unauthorized copy — the “fixed copy” is PG’s own notes.

    Bluntly, the NYT is utterly the wrong party to pretend to be a plaintiff here. Not to mention that being “trained” on the style of the NYT is pretty much guaranteed to result in grievous errors due to the old-school pyramidal structure of the articles, which ceased being a good idea by some time in the 1960s for technological reasons. And ceased being defensible the day Dr. Berners-Lee released the initial specification for the World Wide Web.

    But that’s not at all to say that there couldn’t be an appropriate plaintiff, particularly for ChatGPT learning “how to write good” by studying, say, all of the original fiction published in The New Yawkah from 1992 to 1998. Whether that might actually prove futile, or even possible, is for a literary conference about ten years from now, commenting on a paper with a title like “On the Self-Referential Navel-Gazing Tendencies of Large-Language-Model-Generated Fiction” being presented by a grad student wearing a Mao jacket over a Che Guevara t-shirt. “Expression” is primary in fiction and poetry, and still requires substantial further analysis (or one ends up thinking Robert Frost supported isolationism… and that we really should kill all the lawyers first thing).

    † C’mon, man. We know why all of those Boston Scientific “old bookcases” had to be put in storage — the power company was starting to ask questions, and it’s easier to hide the power draw and smell from the nutrient solutions in the storage unit than at the new, downsized Casa PG.

Comments are closed.