NY Times sues OpenAI, Microsoft for infringing copyrighted works

From Reuters:

The New York Times sued OpenAI and Microsoft on Wednesday, accusing them of using millions of the newspaper’s articles without permission to help train chatbots to provide information to readers.

The Times said it is the first major U.S. media organization to sue OpenAI, creator of the popular artificial-intelligence platform ChatGPT, and Microsoft, an OpenAI investor and creator of the AI platform now known as Copilot, over copyright issues associated with its works.

Writers and others have also sued to limit the scraping — or the automatic collection of data — by AI services of their online content without compensation.

The newspaper’s complaint, filed in Manhattan federal court, accused OpenAI and Microsoft of trying to “free-ride on The Times’s massive investment in its journalism” by using it to provide alternative means to deliver information to readers.

. . . .

“There is nothing ‘transformative’ about using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it,” the Times said.

OpenAI and Microsoft did not immediately respond to requests for comment. They have said that using copyrighted works to train AI products amounts to “fair use.”

Fair use is a legal doctrine governing the unlicensed use of copyrighted material.

On its website, the U.S. Copyright Office says “transformative” uses add “something new, with a further purpose or character” and are “more likely to be considered fair.”

The Times is not seeking a specific amount of damages, but the 172-year-old newspaper estimated damages in the “billions of dollars.”

It also wants the companies to destroy chatbot models and training sets that incorporate its material. Talks this year to avert a lawsuit and allow “a mutually beneficial value exchange” with the defendants were unsuccessful, the newspaper said.

. . . .

The Times filed its lawsuit seven years after the U.S. Supreme Court refused to revive a challenge to Google’s digital library of millions of books.

A federal appeals court had found that the library, which gave readers access to snippets of text, amounted to fair use of authors’ works.

“OpenAI is giving the copyright industry a second bite at control,” said Deven Desai, a professor of business law and ethics at the Georgia Institute of Technology.

“It’s outputs that matter,” Desai said. “Part of the problem in assessing OpenAI’s liability is that the company has altered its products as copyright issues arose. A court could say its outputs at this moment in time are enough to find liability.”

Chatbots have compounded the struggle among major media organizations to attract and retain readers, though the Times has fared better than most.

. . . .

The Times’ lawsuit cited several instances in which OpenAI and Microsoft chatbots gave users near-verbatim excerpts of its articles.

These included a Pulitzer Prize-winning 2019 series on predatory lending in New York City’s taxi industry, and Pete Wells’ 2012 review of Guy Fieri’s since-closed Guy’s American Kitchen & Bar that became a viral sensation.

The Times said such infringements threaten high-quality journalism by reducing readers’ perceived need to visit its website, reducing traffic and potentially cutting in to advertising and subscription revenue.

It also said the defendants’ chatbots make it harder for readers to distinguish fact from fiction, including when their technology falsely attributes information to the newspaper.

The Times said ChatGPT once falsely attributed two recommendations for office chairs to its Wirecutter product review website.

“In AI parlance, this is called a ‘hallucination,'” the Times said. “In plain English, it’s misinformation.”

Link to the rest at Reuters

As PG has opined for some time, he believes that the way AI’s use materials protected by copyright from the Times and others is fair use.

A traditional definition of fair use is any copying of copyrighted material done for a limited and “transformative” purpose, such as to comment upon, criticize, or parody a copyrighted work.

PG finds it difficult to regard the way AI programs use copyright-protected material as anything but extraordinarily transformative. He doubts there is any way someone can prompt an AI program to reproduce an article that first appeared in The New York Times.

If PG still had free access to the huge NEXIS database of newspapers, periodicals, books, etc., etc., he speculates he could perform a search using a paragraph from an NYT article and find more than one identical or quite similar earlier use in another publication.

7 thoughts on “NY Times sues OpenAI, Microsoft for infringing copyrighted works”

    • So, they are automating the robots.txt protocol?
      Not really new.
      “A **robots.txt** file is a text file that webmasters create to instruct web robots, typically search engine robots, how to crawl pages on their website ¹. It is part of the **robots exclusion protocol (REP)**, a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users ¹. The file contains instructions for bots indicating which web pages they can and cannot access ³.

      Webmasters use the **robots.txt** file to control the behavior of web crawlers on their website ¹. The file can be used to block web crawlers from accessing certain pages or directories on the website ². This is useful for webmasters who want to prevent search engines from indexing certain pages on their website ².

      The **robots.txt** file is written in a specific format. Each set of user-agent directives appears as a discrete set, separated by a line break ¹. The basic format of the file is as follows:

      User-agent: [user-agent name]
      Disallow: [URL string not to be crawled]

      Together, these two lines are considered a complete **robots.txt** file ¹. However, one **robots.txt** file can contain multiple lines of user agents and directives ¹.

      Please note that the **robots.txt** file is not a mechanism for keeping a web page out of Google ². To keep a web page out of Google, webmasters should block indexing with noindex or password-protect the page ².

      Source: Conversation with Bing, 12/29/2023


      Crawlers are legal and robots.txt is the opt-out tool.
      That is one of the things the NYT is going to have to answer for. They waited until august 2023 to block crawlers which in cyberspace is implied consent.

  1. There’s room to disagree on whether what large-language-model systems do with input text is fair use. PG believes it’s fair use; I don’t; neither of us is, therefore, evil for that reason. (I’m evil for plenty of other reasons.)

    There’s no room whatsoever to disagree about the irony of the New York Times, in particular, being the plaintiff in a digital database lawsuit after Tasini and Muchnick. Apparently, the newspaper of record (according to itself) is so serious that it can’t recognize irony. And that’s before considering the NYC echobox problem… not to mention the “society pages,” let alone the idiocy of its arts coverage over the entire half century-plus that I’ve been forced to confront it (which itself leads to some interesting copyright questions, but that’s for another time).

    Perhaps there does need to be a lawsuit to put these things in play. But not in the Second Circuit… and definitely not with this plaintiff.

    • Other plaintiffs don’t look to be quite as desperate, though.
      Blaiming a year old chatbot for a decades long decline is a hail mary cash grab.

      (Unless they are hoping MS uses sofa cushion money to buy them out?)

Comments are closed.