Publishers Prepare for Showdown With Microsoft, Google Over AI Tools

This content has been archived. It may no longer be accurate or relevant.

From The Wall Street Journal:

Since the arrival of chatbots that can carry on conversations, make up sonnets and ace the LSAT, many people have been in awe at the artificial-intelligence technology’s capabilities.

Publishers of online content share in that sense of wonder. They also see a threat to their businesses, and are headed to a showdown with the makers of the technology.

In recent weeks, publishing executives have begun examining the extent to which their content has been used to “train” AI tools such as ChatGPT, how they should be compensated and what their legal options are, according to people familiar with meetings organized by the News Media Alliance, a publishing trade group.

“We have valuable content that’s being used constantly to generate revenue for others off the backs of investments that we make, that requires real human work, and that has to be compensated,” said Danielle Coffey, executive vice president and general counsel of the News Media Alliance.

ChatGPT, released last November by parent company OpenAI, operates as a stand-alone tool but is also being integrated into Microsoft Corp.’s Bing search engine and other tools. Alphabet Inc.’s Google this week opened to the public its own conversational program, Bard, which also can generate humanlike responses.

Reddit has had talks with Microsoft about the use of its content in AI training, people familiar with the discussions said. A Reddit spokesman declined to comment.

Robert Thomson, chief executive of The Wall Street Journal parent News Corp said at a recent investor conference that he has “started discussions with a certain party who shall remain nameless.”

“Clearly, they are using proprietary content—there should be, obviously, some compensation for that,” Mr. Thomson said. 

At the heart of the debate is the question of whether AI companies have the legal right to scrape content off the internet and feed it into their training models. A legal provision called “fair use” allows for copyright material to be used without permission in certain circumstances. 

In an interview, OpenAI CEO Sam Altman said “we’ve done a lot with fair use,” when it comes to ChatGPT. The tool was trained on two-year-old data. He also said OpenAI has struck deals for content, when warranted. 

“We’re willing to pay a lot for very high-quality data in certain domains,” such as science, Mr. Altman said.

One concern for publishers is that AI tools could drain traffic and advertising dollars away from their sites. Microsoft’s version of the technology includes links in the answers to users’ questions—showing the articles it drew upon to provide a recipe for chicken soup or suggest an itinerary for a trip to Greece, for example. 

“On Bing Chat, I don’t think people recognize this, but everything is clickable,” Microsoft CEO Satya Nadella said in an interview, referring to the inherent value exchange in such links. Publishing executives say it is an open question how many users will actually click on those links and travel to their sites.

Microsoft has been making direct payments to publishers for many years in the form of content-licensing deals for its MSN platform. Some publishing executives say those deals don’t cover AI products. Microsoft declined to comment.

Link to the rest at The Wall Street Journal

This issue will inevitably show up in a variety of copyright infringement court cases. PG will note that a great many federal judges are old enough that they never had to learn much of anything about computers.

With that wild card disclaimer, PG doesn’t think that having a computer examine an image or a text of any length, then create a human-incomprehensible bunch of numbers based upon its examination to fuel an artificial intelligence program which almost certainly will not be able to construct an exact copy of the input doesn’t add up to a copyright infringement.

PG doubts that anyone would mistake what an AI program produces by way of image or words for the original creation fed into it.

3 thoughts on “Publishers Prepare for Showdown With Microsoft, Google Over AI Tools”

  1. This is a problem based upon the assumption that the seventeenth-century term “copyright” is a German compound noun composed of two words with existing distinct meaning, and itself meaning the exact sum of those two words. That is, that copyright ≡ copy + right .

    I’m afraid not. And it hasn’t been so in the US since the 1790 Act, which also granted authors the right to control non-copy uses of their works.

    The real question is not “is it a copy?”, but “is it (a) a use ordinarily reserved to the author (b) for which there is no systematic defense?” Because it literally is a copy when a digital representation is transferred from storage N to processor M; that’s the nature of how von Neumann processors work. (Maybe it will be different as we move to other computational models and systems — it’s too early to predict whether actually deployable quantum computers will work only by “touching” external data without making a copy.)

    There’s a whole line of cases based on photography — some from long before the digital age — that make as clear-as-it’s-going-to-get that the literal copying aspect is secondary to the authorial/holder limited right to control reuses (copies being one variety of reuse). The ethical question is “what are those limits” — a much harder, and changing over time, inquiry (example: until the Townsend Amendments (1912), a film based on a novel was not contemplated as an infringement of copyright). But at least it’s the correct inquiry.

  2. We are also very likely to see one instance of AI reading what another instance generates. This can be within the same brand name, or between them. It becomes a recursive process. The original source can easily be lost in cascading references and cross-references.

  3. From where I’m reclining, I think they are using the wrong argument.

    As the Google Hathitrust case showed, computer processing of content that does not result in a full substitute for the content (attn: Internet Archive) does not infringe copyright. Neither the models nor their answers are one-for-one substitutes for any of the content processed.

    Also, in this case, the model building software can be described as extracting ideas and concepts out of the processed material. Which aren’t copyrightable, right? At most one might argue they are building a form of database from the ideas and concepts the model “learns”. Again, content in a database isn’t copyrightable, just the organization. At most. So building the models is home free.

    Now, in use, the models scan the internet (like browsers have done for three decades) and provide relevant extracts and provide relevant hyperlinks. (Again, Google/Hathitrust). Or they process user input and provide an *original* output. Or redirect the user input.

    (The only question I can see is whether the user inputs are significant enough to confer copyright on the output. To which I can only point to: Photography.)

    Do these folks even understand what world they live in?

    As a mere engineer, I await clarification from the legal eagles. 🙂

Comments are closed.