Home » Disruptive Innovation » Google Is Using Romance Novels To Build Artificial Intelligence

Google Is Using Romance Novels To Build Artificial Intelligence

30 September 2016

From The Guardian:

When the writer Rebecca Forster first heard how Google was using her work, it felt like she was trapped in a science fiction novel.

“Is this any different than someone using one of my books to start a fire? I have no idea,” she says. “I have no idea what their objective is. Certainly it is not to bring me readers.”

After a 25-year writing career, during which she has published 29 novels ranging from contemporary romance to police procedurals, the first instalment of her Josie Bates series, Hostile Witness, has found a new reader: Google’s artificial intelligence.

“My imagination just didn’t go as far as it being used for something like this,” Forster says. “Perhaps that’s my failure.”

Forster’s thriller is just one of 11,000 novels that researchers including Oriol Vinyals and Andrew M Dai at Google Brain have been using to improve the technology giant’s conversational style. After feeding these books into a neural network, the system was able to generate fluent, natural-sounding sentences. According to a Google spokesman – who didn’t want to be named – products such as the Google app will be “much more useful if they can capture the nuance of language better”.

. . . .

“We could have used many different sets of data for this kind of training, and we have used many different ones for different research projects,” he adds. “But in this case, it was particularly useful to have language that frequently repeated the same ideas, so the model could learn many ways to say the same thing – the language, phrasing and grammar in fiction books tends to be much more varied and rich than in most nonfiction books.”

The only problem is that they didn’t ask. The Google paper says that the novels used in this research were taken from “the Books Corpus”, citing a 2015 paper by Ryan Kiros and others which describes how the authors “collected a corpus of 11,038 books from the web”, describing them as “free books written by [as] yet unpublished authors”. It’s a collection that has been used by other researchers working in artificial intelligence and which is currently available for download in its entirety from the University of Toronto.

Forster says that she “always appreciates an interesting use of words”, but while Hostile Witness is available to download for free, no one asked her permission to use her novel as raw material to train a computer.

“Perhaps I’m still thinking in the old way, that a reader will read my book – it didn’t even occur to me that a machine could read my book. What I found curious was that these were referred to as ‘free books written by as yet unpublished authors’ because my state is very different,” she says.

Link to the rest at The Guardian

Disruptive Innovation

21 Comments to “Google Is Using Romance Novels To Build Artificial Intelligence”

  1. So their AI will want to romance you instead of providing the info you were searching for? Hmmm, could they be trying to make bing look better? 😛

  2. This is research. Just research.

  3. I don’t understand why the quoted author is upset she wasn’t asked for her permission. The researchers acquired the book through a legitimate channel. They aren’t disseminating it. This seems functionally equivalent to an ESL student buying a book for exposure to natural language.

    • She never heard of Fair Use?
      It seems a common failing among the AG/AU crowd.

      • Another reason to add to the ones before – I plan to never have Pride’s Children free (see blog post, type “odd reason” with quotes into search box if curious).

        ‘Free’ is temporary a lot of the time, but once they’ve examined her dialogue use (I imagine that’s mostly what they’re looking at), they won’t go back and UNuse it if she puts the books up for sale.

        It just feels creepy – and are they aware that ‘free’ books are often (not always) not quite finished? My first draft dialogue was horrible. Got the points I was trying to make into the story, but needed severe revision – to sound human.

        • I’m not clear on your concern here:

          ‘Free’ is temporary a lot of the time, but once they’ve examined her dialogue use (I imagine that’s mostly what they’re looking at), they won’t go back and UNuse it if she puts the books up for sale

          If I ‘buy’ one of your books during a temporary free period, do you expect me to refrain from rereading the copy in my possession after the price has gone up? Am I barred from remembering characters, plot, dialogue, lovely turns of phrase after the price has gone up?

          • Just commenting that they claimed to read free online books – which would come under fair use, I imagine.

            But online books, whole novels, are not necessarily that way forever – and this is a use she didn’t approve (mining her dialogue for humanness) and could potentially sell.

            You buying a copy of a book entitles you to read it as many times as you like. If it’s paper, you can pass it on or sell it. The ebook, however, is a license to read – not ownership you can (currently) pass on or sell.

            I don’t know – probably nobody does yet, and it would be a matter for the courts to adjudicate (way over my legal knowledge here) – whether there are any rights to the use of your text for purposes other than a human reading it.

            It just opens more areas to ‘who can use how much of your text for what purposes.’

            • […]this is a use she didn’t approve (mining her dialogue for humanness) and could potentially sell.

              That’s an interesting point. Is reading a book for machine training purposes a separable right from reading a book pleasure or education or titillation? You certainly can and should separate print rights, ebook rights, audio rights, TV and film adaptation rights, translation rights, even regional rights if you want, but separating out why a book is read… Interesting question.

              • Nobody thought about ebook rights 100 years ago, or even movie rights 150 years ago – and those are worth big money now.

                Kris Rusch recommends licensing only specific rights that the licensee can use right away profitably.

                Makes sense, when technology has accelerated so fast over my lifetime.

                If there’s money in it, someone will attempt to profit. Might as well be the creator.

                • What if someone reads your book–and gets an idea for something else?

                  For instance, say there’s a certain character that if you take their plot arc and twist in another direction–the story becomes something entirely new.

                  Do you think that is something authors should have the right to protect their works from also?

                  Copyright protects your words in the order you placed them in. It doesn’t protect your work from ideas people may get from reading it.

      • But is it fair use? From this, “It’s a collection that has been used by other researchers working in artificial intelligence and which is currently available for download in its entirety from the University of Toronto,” I get the impression that the book Hostile Witness has been bundled with other books and is being redistributed. I don’t really think of that as fair use.

        I’d have a totally different take on this if it were a list of books, and the researchers then had to go out and get a copy, free or not.

    • Agree. If I read her novel and learn something from it will the author be upset about that?

      Dan

  4. I don’t believe authors (myself included) have any right to determine how our books are used.
    I’d prefer to be read, or course. But, if someone wants to use my books to keep warm (by burning them) I’m not gonna be too happy, but at least my books are being used.
    Likewise, using my books for a research project (google AI) sounds like something I’d be very happy to contribute to.

  5. It sounds like The Books Corpus is on legally shaky ground and the University of Toronto is possibly distributing books without the right to so do. That’s a separate issue from using a purchased book to train software. Is that Fair Use? interesting question. Is it using the text for research? Using it for commercial purpose? A form of republication if the AI retains text in memory?

    • Didn’t find much about ‘book corpus’ online, but it looks like it may be a collection of uploaded books provided by Wattpad? In which case, it’s probably perfectly within their terms and conditions.

    • Using books for input to machine learning, which is what upset her, is definitely fair use. Ten years of AG litigation against Google established that replacing a human with a machine does not make the act (quoting, reading, etc) suddenly illegal.

  6. Martin L. Shoemaker

    Would she be equally concerned if someone used the book in an adult literacy program without her permission? If someone read it to a lover?

    She released it into the world. She can control reproduction of it, but not other uses.

  7. Re: an article in March, Stanford uses “the Wattpad corpus” to teach its AI programs about the world. With permission from Wattpad no less.

    If it is up on the Internet for free download and reading, both people and machines are downloading it and reading it. That is the whole point of the Internet.

  8. First off, the Toronto Books Corpus appears to have been taken down.

    Second, it would appear that natural language training papers have been using all sorts of ad hoc corpora for years. For example, there’s a 2014 paper using a “Blog corpus” of 641, 288 blog posts.

    Third, many nations explicitly define scientific or technical experimental uses for texts as fair use, or as permitted under other grounds. (Other educational uses are sometimes included.)

    Fourth, the Google Books corpora are what people normally use; this study and the Stanford study are different because they attempt to use only recent texts and only books that are fiction.

  9. First off, the Toronto Books Corpus appears to have been taken down as of today. But the idea was to take freely available books and turn them into a very long database of “ordered sentences”, sentences connected to each other and making sense. The fact that every sentence was a natural language sentence produced by a human was what was valuable. Paragraphing and other formatting was apparently indicated in the dataset, but different researchers could use the sentence database in different ways.

    Second, it would appear that natural language training papers have been using all sorts of ad hoc corpora for years. For example, there’s a 2014 paper using a “Blog corpus” of 641,288 blog posts. If you use a corpus in a paper, you pretty much have to keep a copy of the data, from that moment you used it; and you have to make it downloadable or otherwise obtainable by others, so that they can check your work and see if it can be duplicated. Otherwise, it’s not scientific.

    Third, many nations explicitly define scientific or technical experimental uses for texts as fair use, or as permitted under other grounds. (Other educational uses are sometimes included.)

    Fourth, the Google Books corpora are what people normally use; this study and the Stanford study are different because they attempt to use only recent texts and only books that are fiction. But I guess everybody will go over to the Wattpad corpus now.

    • Oh, and most Creative Commons licenses include scientific papers as being a “non-commercial use,” even if the scientists work for Google, or if their data is eventually used by a company somewhere.

Sorry, the comment form is closed at this time.