Why I let an AI chatbot train on my book

From Vox:

Four years ago, I published my first book: End Times: A Brief Guide to the End of the World.

It did … okay? I earned a Q&A with the site you’re reading now — thanks, Dylan! — and the book eventually helped get me the job of running Future Perfect. I had one day where I went from radio hit to radio hit, trying to explain in five-minute segments to morning DJs from Philadelphia to Phoenix why we should all be more worried about the threat of human extinction and what we could do to prevent it.

But a bestseller it was not. Let’s put it this way — about every six months, I receive a letter from my publisher containing a “non-paying royalty statement,” which is sort of like getting a Christmas card from your parents, only instead of money, it just contains a note telling you how much they’ve spent raising you.

So I’ll admit that I was a bit chuffed when I received an email a couple of months ago from people at aisafety.info, who are aiming to create a centralized hub for explaining questions about AI safety and AI alignment — how to make AI accountable to human goals — to a general audience. To that end, they were building a large language model — with the delightful name “Stampy” — that could act as a chatbot, answering questions people might have about the subject. (The website was just soft launched, while Stampy is still in the prototype stage.) And they were asking permission to use my book End Times, which contains a long chapter on existential risks from AI, as part of the data Stampy would be trained on.

My first thought, like any author’s: Someone has actually read (or at least is aware of the existence of) my book! But then I had a second thought: As a writer, what does it mean to allow a chatbot to be trained on your own work? (And for free, no less.) Was I contributing to a project that could help people better understand a complex and important subject like AI safety? Or was I just speeding along the process of my own obsolescence?

Training days

These are live questions right now, with large language models like ChatGPT becoming more widespread and more capable. As my colleague Sara Morrison reported this summer, a number of class action lawsuits have already been filed against big tech firms like Google and OpenAI on behalf of writers and artists who claim that their work, including entire books, had been used to train chatbots without their permission and without remuneration. In August, a group of prominent novelists — including Game of Thrones author George R.R. Martin, who really has some other deadlines he should attend to — filed suit against ChatGPT maker OpenAI for “systematic theft on a massive scale.”

Such concerns aren’t entirely new — tech companies have long come under fire for harnessing people’s data to improve and perfect their products, often in ways that are far from transparent for the average user. But AI feels different, as attorney Ryan Clarkson, whose law firm is behind some of the class action lawsuits, told Sara. “Up until this point, tech companies have not done what they’re doing now with generative AI, which is to take everyone’s information and feed it into a product that can then contribute to people’s professional obsolescence and totally decimate their privacy in ways previously unimaginable.”

I should note here that what aisafety.info is doing is fundamentally different from the work of companies like Meta or Microsoft. For one thing, they asked me, the author, for permission before using my work. Which was very polite!

Beyond that, aisafety.info is a nonprofit research group, meaning that no one will be making money off the training data provided by my work. (A fact which, I suspect, will not surprise my publisher.) Stampy the chatbot will be an educational tool, and as someone who runs a section at Vox that cares deeply about the risk of powerful AI, I’m largely glad that my work can play some small role in making that bot smarter.

And we desperately need more reliable sources of information about AI risk. “I think the general understanding of AI alignment and safety is very poor,” Robert Miles of aisafety.info told me. “I would say that people care a lot more than they used to, but they don’t know a lot more.”

Chatbots, trained on the right source materials, can be excellent educational tools. An AI tutor can scale itself to the educational level of its student and can be kept up to date with the latest information about the subject. Plus, there’s the pleasant irony of using some of the latest breakthroughs in language models to create an educational tool designed to help people understand the potential danger of the very technology they’re using.

What’s “fair use” for AI?

I think that training a chatbot for nonprofit, educational purposes, with the express permission of the authors of the works on which it’s trained, seems okay. But do novelists like George R.R. Martin or John Grisham have a case against for-profit companies that take their work without that express permission?

The law, unfortunately, is far from clear on this question. As Harvard Law professor and First Amendment expert Rebecca Tushnet explained in an interview published in the Harvard Gazette, digital companies have generally been able to employ concepts of fair use to defend harvesting existing intellectual property. “The internet as we know it today, with Google and image search and Google Books, wouldn’t exist if it weren’t fair use to use these words for an output that was not copying” the original, she said.

One way to consider this is to think about how humans, like myself, write books. When I was researching and writing End Times, I was drawing upon and synthesizing the existing work of hundreds of different authors. Sometimes I would quote them directly, though there are specific rules about how much of an individual work another author can directly quote from under fair use. (The rough rule is 300 words when quoting from a published book, or around 200 words for a briefer article or paper.)

More often, though, what I read and processed in my research rattled around in my brain, combined with other reporting and reasoning, and came out as my own work — my work informed by my own sources. Or, in other words, informed by my own personal training dataset.

Link to the rest at Vox

2 thoughts on “Why I let an AI chatbot train on my book”

  1. What I’m starting to see is that a substantially greater proportion of authors of trade nonfiction (n>45) have specifically stated that they do not object to their works being used to train large-language-model-based Enhanced Eliza systems† than for trade fiction (n>30). Indeed, on these relatively small self-selected samples, the proportions are inverse within the margin of error (which is potentially misleading because it’s calculated on self-selected samples).

    This reflects one of the main problems with copyright law: That it’s a strictly binary judgment, both as to what is protected and the consequences of “copying.” Copyright law attempts to apply an identical framework to every copyrighted work, pretending that purely factual nuance will successfully distinguish among, say, the film All the President’s Men, the nonfiction book upon which the film was based, and the Watergate Hearings transcripts found at the National Archives. The point is that there isn’t a principled way to do so, which is inevitably going to lead to namecalling, shirtrending, irrational appeals to different aspects of “progress”… and the triumph of the superior original position.

    † What is inaccurately referred to as “generative AI” — inaccurate because it’s nongenerative in that it responds only to a specific request for output, and not AI because it cannot reason from a dataset to a conclusion not implicit in that dataset without a leading question/prompt.

    • FWIW, I would refer to the tech as a whole as Automated Data(set) Analysis.
      Because it more accurately describes what the software does and the acronym has been wasted on a deadend.

Comments are closed.