AI Training Permission

From Hugh Howey:

A comment on my previous post about not using AI in my stories — and using the copyright page to make this explicit — is worth responding to in its own post, because I think it raises important issues.

The comment comes from Pat, who says:

I would think a better use of the Copyright would be to declare that no AI could be TRAINED on the copyrighted work. AI has no originality, it can only take in large quantities of material and try to splice it back together in a (usually) coherent manner. Declaring your works off-limits for AI to use as training material means AI will never be able to create “in the style of Hugh Howey” and limits the range of things AI can learn. If enough creative people do this, AI can’t learn from anything and won’t be able to create anything, at least outside places like Adobe where they own a zillion images copyrighted to themselves so they can do whatever they want with them.

Pat Augustine

I respect this opinion, and it is all very well-said, but I disagree with most of it and I’d love to explain why.

The idea that AI can be halted in its tracks if we prevent it from learning on copyrighted works misses the fact that there are more than enough works in the public domain to train LLMs.

Even if this weren’t so, I want AI trained on my work. I have a very positive view of AI. These models are, in a way, a distillation of our combined intelligence, our thoughts, our wisdom, our unique writing voices. I love being a part of that. I love that we are all contributing to it and building something that will certainly outlast us individually and may very well outlast us collectively.

When humans are extinct, our sun an old tired red giant, and what’s left of us is cruising among the stars, I like to think that some tiny sliver of me is out there intermingling with some tiny sliver of you. Even these words I’m typing right now. We are creating something very special, almost like a child of our every mind, and I think that’s amazing.

Also, guess what? You don’t have a choice. Legally. 70 years after you die, your works will become part of the public domain. The idea that AI is never allowed to be trained on your data is just wrong. It’s a matter of when. If you want to delay it as long as possible, awesome! Go for it. Just know that it’s a temporary thing.

The last thing I disagree with here (and the most important) is the claim that LLMs can’t be creative. I’ve played with LLMs enough to say this with complete confidence: what they do is similar enough to what we do that it’s a question of difference and not kind. If they aren’t creative, then we aren’t creative, and the word has no meaning. Today’s most advanced LLMs are definitely creative, and astoundingly so. They can generate new ideas never seen before. They aren’t just rearranging what’s already out there, they are “thinking” in much the same way that we “think.”

Link to the rest at Hugh Howey

PG says that Hugh is thinking quite clearly and rationally about AI.

The fact is that AI writing (and art, legal writing and a zillion other AI applications) is here to stay and will become more sophisticated over time. That said, PG predicts that quality authors and other creative professionals will continue to create unique and original work that will find an audience willing to pay to experience the benefits of a creative human mind.

Here’s a link to Hugh Howey’s books.

15 thoughts on “AI Training Permission”

  1. Whenever I hear AI criticized for its lack of original thought, I have to ask, “What is original thought?” Who thinks it? Humans?

    What thoughts do we have that are not splicing back together bits and pieces of what we already have learned and observed? That arrangement of those bits and pieces may be original. If I do it, or if AI does it, it is still an original arrangement and an original thought. I suspect we overrate ourselves.

    Just for fun, try to imagine something that does not have a basis in something you already know. If you do, tell us what it is.

  2. The way I take it is that the internet is, at the end of the day, for better or worse, the sum total of *accesible* human knowledge. What else are you to train a tool meant to amplify human mental activity?

    If you are part of humanity and are contributing anything to the future, you *want* to chip in your grain of salt. So I can understand Mr Howey’s position.

    Besides, if you didn’t want your “precious” to be accessible to web crawlers, why’d you put it online in the first place? No paywall, not spider.txt fence, nothing?

    The gripers also have a problem with scale: the companies doing the training could havd (and might have) *bought* a copy of every ebook commercially available for what is to them sofa-cushion change. (Think of it: Amazon carries, what, 4 million ebooks? At an average $10, that would add up to $40 million dollars? That’s not even a rounding error on a $10B investment. For context: Microsoft is buying Activision for $68.7B. The media refers to it routinely as a $70B merger, because what another billion or so at that level?)

    And yet again, on a tool trained on 100T discrete pieces of information, can anybody point to any single piece and prove to a certainty it was accessed and were harmed by it? Do they even have standing to argue, given that the gripers alrrady surrendered control of their cooyright to the glass towers? Woudn’t it fall to *them* to prove anything? (They already tried once and failed on a smaller project.)

    All the knee jerks every time a new tech emerges is getting tiresome.
    They should first learn which world and year they live in.

  3. Humans evolved over millions of years through evolutionary selection.

    We think we have free-will and choice. But, just because we can’t predict our future choices faster (than the time it takes for us to act on our choices), doesn’t mean that everything we do isn’t already determined from the initial states that led to our creation.

    LLMs were created through technology. An argument can be made that LLMs are not intelligent, as they’re just code. This argument can also be used to dismiss human intelligence (Searle’s thought experiment).

    This leads me to the conclusion (for some definitions of consciousness) that either both us and LLMs are conscious, self-aware, and intelligent; or that neither LLM and us are truly conscious, self-aware, and intelligent.

    My fear (though I don’t need any reassurance on the matter) is that LLMs prove that humans are just LLMs too (but, I’m just an SF author so, what do I know?).

    • The difference between a human and an LLM:

      Humans use language as a way of symbolically manipulating referents. When we use the word ‘hand’, we know what a hand is; most of us have a couple of them ourselves. We know how to work them, what they are good for, what they are not, how a left hand differs from a right hand. All that nonverbal information is accessible to us, and the word is a label for it. One degree more removed, we have direct sensory knowledge of referents that are not part of ourselves. We see that growing grass is green, but sometimes it turns brown; we abstract the quality green from the object grass, and give names to them both, and this enables us to consider them separately and to combine them in interesting ways with other concepts.

      LLMs use language in and of itself. They have no access to referents, and strictly speaking, no concepts at all. Hand, grass, and green are merely words to them, which occur in combination with other words in certain discoverable patterns, but have (to them) no meaning apart from that – no extrinsic qualities at all. Their universe is the universe of language, and the physical world for which language provides the labels does not even impinge on them, except as the medium in which their hardware operates and therefore the limit to their computational power.

      Sarah Hoyt once used an AI image generator to create an illustration for a blog post. It was a picture of a man, looking much like a stylized worker from a Soviet propaganda poster, holding some kind of long-handled tool over his head in both hands. Except that both his hands were left hands, and the tool was a ‘widget’ straight out of M. C. Escher. You see, the software was not built to model its subjects in 3-D space as if they were real, and had no conception that these things were wrong. It merely assembled bits and pieces of other flat images according to the metadata that accompanied them.

      Margaret Halsey once said, ‘Englishwomen’s shoes look as if they had been made by someone who had often heard shoes described but had never seen any.’ To an LLM (and its image-generating cousins), the entire universe is something that it has often heard described but never seen. This places a hard limit on what it can accomplish; and that limit is well short of understanding, modelling, or thinking about the world in the proper sense of these words.

      I have said before that if you think of an LLM as an enormously well-read parrot, you will not be far from the mark.

      • “LLMs use language in and of itself. They have no access to referents, and strictly speaking, no concepts at all. Hand, grass, and green are merely words to them, which occur in combination with other words in certain discoverable patterns, but have (to them) no meaning apart from that – no extrinsic qualities at all.”

        Straight to the point. LLMs are just general purpose symbol manipulation software, just like spreadsheets are number manipulation software.

        There are other types of “AI” that do operate on the referents and the meaning (for certain forms of meaning) of those referents. But those are niche applications, the equivalent of savants, that are limited to processing a very specific class of data (say protein topology, or orbital radar data, or materials science mixtures). None of those functions would be confused with “thinking” or intelligence yet that is where the biggest value of machine learning lies.

        No need to panic.
        Software is no more going to replace humanity than assembly line robots did.

  4. The key point is that Mr Howey is perfectly entitled to make a decision, for himself, that he’s just fine with generative AI engines training using his prose (in whole or in part). Simultaneously, my late client Mr Ellison would have been perfectly entitled to make a decision,† for himself, that he’s vehemently opposed to generative AI engines training using his prose (in whole or in part).

    • Nobody should be fine with Mr Howey granting that permission for Mr Ellison
    • Nobody should be fine with Mr Ellison barring that permission for Mr Howey
    • Nobody should be fine with generative AI engines — or, more to the point, the people directing them (notwithstanding any business-entity or research-institution BS) — not even asking permission… even if that’s expensive or administratively inconvenient

    Perhaps to Marines, it’s easier to ask foregiveness than get permission. Leaving aside the dubious wisdom of modelling one’s social relations on the Marines — although that’s a far from irrelevant consideration — there’s the further question of exactly what portion of a writer’s work is appropriate to train the generative AI engine. One wonders how well the novel-writing machines would have been able to use the writings of George Orwell as source material that had been first published when he “was still” Eric Blair…

    Plenty of entitlement to go around here… especially when imposing one’s own concept of “writing” on someone else. Back to your lives, citizens.

    † The potential was discussed a couple of decades ago; we could both see it coming, at an uncertain time. Hell, we’d both written about it — Mr Ellison in fiction, me in scholarly forms. Further counsel sayeth not.

    • You’re assuming, just like the gripers, that permission is required.
      Where does it say it is needed?
      All existing evidence runs counter to that assumption.
      Web crawlers are established internet practice with established rules for fencing ff material not to be crawled.

      All precedent is that “permission” for crawlers is opt-out, not opt-in and a mechanism already exists for opting out. Just because authors don’t like what *their publishers* do or don’t do doesn’t give them much base to be making retroactive claims on something that has been established practice for 30 years, since the days of ALTAVISTA.


      Note that while most crawlers are used for browsers and internet indexing, not all are. Remember DATA GUY and Author Earnings? How’d he get his data? Via spider. Funded by, ahem, Mr Howey.
      It would be hypocritical if he objected to other people’s crawlers after unleasing one to gain insight into other people’s business. He doesn’t seem to be.

      Note this passage at cloudflare (whose business is website security, BTW,

      “Robots.txt requirements: Web crawlers also decide which pages to crawl based on the robots.txt protocol (also known as the robots exclusion protocol). Before crawling a webpage, they will check the robots.txt file hosted by that page’s web server. A robots.txt file is a text file that specifies the rules for any bots accessing the hosted website or application. These rules define which pages the bots can crawl, and which links they can follow. As an example, check out the robots.txt file.”

      As to “the irascible Harlan Ellison” I have no doubt he’d be railing against this well established form of fair use, but fortunately it didn’t exist in his time so he was spared the heartburn. But that was then, this is now, snd there are 30 years of precedent supporting online fair use.

      • Here is an entire discussion on web crawlers:


        “Crawlers can validate hyperlinks and HTML code. They can also be used for web scraping and data-driven programming.”

        And data-driven programming, like the Author Earning’s spider, is exactly how LLM’s are trained. There is nothing new or different to the practice, just the scale.

        The rest of this particular link is purely educational for those interested in how sausage is made. 😉
        The plumbing that makes civilization run is…complicated.

      • But in the case of Mr Ellison, it would have been required: There were no legitimate, legal examples of his writing on the web for a crawler to access. All examples of his work actually available on the web were either pay-to-download electronic books with his authorization, or pirate scans of printed books.

        Again, though, this comes down to the difference between the way the human mind works (analysis without actually fixing† a copy) and the way a von Neumann architecture computer works (analysis can be performed only upon a local fixed copy, even if only in local memory). It is possible that quantum computing and other non-von-Neumann architectures that I can’t describe, or really imagine in any more than hand-waving prose, will no longer require a local copy for analysis. Until then: They are completing a copyright infringement by the act OF making that local copy (notwithstanding the built-up tradition surrounding web crawlers, which is very much like the tradition of high-school students always partying on beach X when beach X is private property and there’s no California Coastal Commission providing a legal public right of access to that beach).

        It’s all a messy “implied license” thing. And it neglects something else:

        Don’t treat “permission” as solely restricted to “permission required by law.” I’m adding in the concept of “let’s have just a tiny, tiny bit of common courtesy in here, too, notwithstanding the general lack thereof on social media.” So we’re not even talking entirely about the same thing to start with.

        tl;dr I’m afraid this rests on a counterfactual: Certainly as to Mr Ellison, permission is both legally and ethically required.

        † “Fix” is a term of art in copyright law. It includes any complete, literal copy of a copyrightable element or work that can be further copied without damage or further access to the original. (And that’s the least-technical explanation I can offer.) There is plenty of both precedent and real-world exemplars of “temporary copies in memory” (or even “in a processor’s internal register”) being understood as “fixed,” and virtually nothing to the contrary. Whether that’s the way it should be in the best of all possible worlds is irrelevant; this is not the best of all possible worlds, as epitomized by the very existence of VARA (§ 106A of the Copyright Act). If you want to spot some special snowflakes to pick on, I suggest Jeff Koons et al.

    • Show of hands.

      Given a choice only two options, would you model your social relations on Marines or lawyers?

  5. I try to look for hidden assumptions and a lot of the griping over LLMs seems to be assuming that “somebody is making money and it’s not me”.

    False assumption:

    “OpenAI spends about $700,000 a day, just to keep ChatGPT going. The cost does not include other AI products like GPT-4 and DALL-E2. Right now, it is pulling through only because of Microsoft’s $10 billion funding”

    Worth a read.
    TLDR? There may not be any money to sue for.

    In a way, the current wave of LLMs remind me of Mosaic, the first browser,and NETSCAPE, the first commercial browser (built on stolen code from NCSA, BTW).

    LLMs are not commercial products themselves but tools for creating products.
    (OpenAI makes their money off DALL-E and API and ChatGPT subscriptions, not the free chatbot most people use.)

    It is early days, too early for outrage over commercialization, when the target of tbe angst may be gone and forgotten in a year. By the time any cases get tossed, the world will have moved on to the next development tool.

    A bit of history may be a good guide: Microsoft was one of the earliest legal licensors of Mosaic and used it as a starting point for Internel Explorer, which superceded both Mosaic and Netscape. The real money was in the web sites and apps–as MKCROSOFT argued in court–not in the browser as Netscape pretended.As before, so now.
    Like 40 years ago, MS has licensed the pioneer’s code and models to *supplement* and turbocharge their inhouse tech that is going into the actual revenue generating products. What worked then is working now.

    There is no money in training models but there is big money in the apps the models help create.

Comments are closed.