AI is coming for your audiobooks. You’re right to be worried.

From The Washington Post:

Something creepy this way comes — and its name is digital narration. Having invaded practically every other sphere of our lives, artificial intelligence (AI) has come for literary listeners. You can now listen to audiobooks voiced by computer-generated versions of professional narrators’ voices. You’re right to feel repulsed.

“Mary,” for instance, a voice created by the engineers at Google, is a generic female; there’s also “Archie,” who sounds British, and “Santiago,” who speaks Spanish, and 40-plus other personas who want to read to you. Apple Books uses the voices of five anonymous professional narrators in what will no doubt be a growing stable: “Madison,” “Jackson” and “Warren,” covering fiction in various genres; and “Helena” and “Mitchell,” taking on nonfiction and self-development.

I have listened to thousands of hours of audiobooks (it’s my job), so perhaps it’s not a surprise that I sense the wrongness of AI voices. Capturing and conveying the meaning and sound of a book is a special skill that requires talent and soul. I can’t imagine “Archie,” for instance, understanding, much less expressing, the depth of character of say, David Copperfield. But here we are at a strange crossroads in the audiobooks world: Major publishers are investing heavily in celebrity narrators — Meryl Streep reading Ann Patchett’s “Tom Lake,” Claire Danes reading “The Handmaid’s Tale,” a full cast of Hollywood actors (Ben Stiller, Julianne Moore, Don Cheadle and more) on “Lincoln in the Bardo,” to name a few. Will we reach a point where we must choose between Meryl Streep and a bot?

The main issue is, naturally, money. The use of disembodied entities saves time and spares audiobook producers the problems of dealing with human beings — chief among them, their desire to be paid. This may explain why so many self-published books are narrated by “Madison” and her squad of readers. Audible insists that every audiobook it sells must have been narrated by a human. (Audible is a subsidiary of Amazon, whose founder, Jeff Bezos, owns The Washington Post.) Major publishing houses say the same. But how long until they see the economic benefits of AI?

Jason Culp, an actor and award-winning narrator who has been recording audiobooks for more than a quarter of a century, knows how much goes into a production. A 10-hour audiobook, he says, takes a narrator something like four or five days, with a couple of additional hours for editing mop-up. For each finished hour of audio, narrators make about $225 — somewhat more for the big names — and editors, about $100. Beyond that, producers must pay a percentage to SAG-AFTRA, the narrators’ union. There are other production costs too, of course, but you can see how eliminating the human narrator appeals to the business mind.

Apple’s narrators are cloned from the voices of professionals who have licensed the rights to their voices. Their identities are secret, but speculation abounds. It’s a touchy subject, and you can see why. Whether to sell the rights to one’s voice is an agonizing decision for a professional narrator. The money offered amounts to something like what a midrange narrator makes in four years; on the other hand, agreeing to the deal seems to many to be a betrayal of the profession, one that would risk alienating one’s peers.

According to Culp, narrators are alarmed by the advent of AI narration “as, naturally, it might mean less work for living, breathing narrators in the future. We might not know the circumstances under which a narrator might take this step, but generally there is a lot of solidarity within the community about encouraging narrators not to do it. As well, our union is keeping a close eye on companies that might be using underhanded tactics to ‘obtain’ narrators’ voices in works that they have produced.”

Even though the notion makes my skin crawl, I listened to Madison’s narration of “The New Neighbor” by Kamaryn Kelsey, the author of almost 60 self-published books (Apple, 1½ hours). This is the first installment in a series of 19 detective stories starring female private investigator Pary Barry. The plot is entertaining enough, and Madison is a slick operator, in the sense that you can believe that she’s human — for about five minutes.

Link to the rest at The Washington Post

PG asks, “When you listen to an audiobook, are you focusing on the performance of the narrator or the book itself? Do you forget about the narrator’s voice after a few pages?”

While the human narrator is certainly capable creating a better or worse “performance,” the narrator’s first obligation is not to interfere with the listener’s enjoyment of the book.

PG wonders if someone’s appreciation of a particular human performer may be a little like wine-tasting. Some people have a palate that always discriminates between a good or bad wine, for others, unless they have a side-by-side comparison, are fine with the equivalent of a house wine.

PG suggests that a very large portion of the present and future listeners to audiobooks will be perfectly happy with the house wine.

(Note: Although PG has not tasted wine for several decades, he does recall the various business lunch/dinner performances of the sommelier carefully uncorking a bottle, presenting the cork for a sniff test by whichever businessperson was paying for the meal and drinks, pouring a bit into a wineglass for the host to swirl around, sniff, then swallow delicately, look into the air, then communicate approval. On more than one occasion, a host who was also a good friend would admit he had no idea what the difference in taste was between an expensive wine and the house wine. To indicate how long it’s been since PG has witnessed this ceremony, he doesn’t ever recall the presence of a business hostess. No, those were not the good old days for PG. He prefers the present.)

15 thoughts on “AI is coming for your audiobooks. You’re right to be worried.”

  1. I was quite happy with an old generation of Kindles where I could just flip it into audio mode and a tin voice with no inflection read the book. The letters on a page aren’t inflected either.

  2. You can increase the speed. I listen to most books at 2x. Most narrators have been recording at a slower pace the last few years for better enunciation knowing the average listener increases their speed anyway.

  3. I hate being read to, I really do. I’m a very fast reader, and speaking is so much slower, that I can barely assimilate the information from the start of the text without it expiring from my brain before the sentence ends. Reading and Performance are not the same thing.

    Now, of course, in a stage presentation, the performance is what matters. But for ordinary reading, any sort of automation robs me of the ability to rush forward, back up to see if I missed something, and so forth. Audiobooks just aren’t my modality.

    I did do an experimental “Narrated by Author” production for one of my books (where I had to face my own Welsh pronunciations as a penalty), and the process was educational and fun, but just not cost-effective.

  4. A performance of a book with character using different actors is always entertaining, but I suppose because it makes the story sound like a radio play.

    Audio books are okay, but I find them too slow, and always wish I were reading them myself. There again I’m a pretty fast reader.

    • They don’t work for me except for some non-fiction where it is like auditing a college lecture. And even there youtube videos work better for charts and graphs and situational video.

      I’m actually surprised nobody has tried doing video books for that kind of material. Maybe with AI video tools somebody will try it. Cookbooks seem like a natural.

      • What is a video book? I’m trying to imagine this, and it sounds like if a channel like Invicta did a series of episodes on a book. At the link you’ll notice animated graphics and “cartoons” as the narrator is describing the last stand of the Finns in the Winter War. You can see examples in the 50-second intro. But that sort of episode takes skill and time and resources beyond what AI would be able to help with. It would be interesting if an author licensed such a channel to narrate / dramatize their book, but it doesn’t seem likely. With e-books, I remember Amazon trying enhanced ebooks that included videos and such, but I think they ended that.

        • ebooks are essentially wrapped websites.
          A (non-fiction) video book would be a wrapped youtube channel with a voiceover discussing the book subject over a slide show/video/graphs/charts… Essentially a canned presentation/lecture.

          The video you presented is a good example of what a chapter would look like.
          Or this:

          https://m.youtube.com/watch?v=RrVkFTKdKPs

          Note how the visual parts track and reinforce the narrative.

          Did you ever work with Windows old Movie maker app, where you stiched together a video out of a series of clips? A video book builder app would be like that: you import a text and it attaches a timeline with markers for each page and you go page by page, highlighting blocks of text to be rendered or linking charts or graphs to be presented while the AI narrator “reads” the page.

          It would not be automatic, not for a book length project, but faster and way cheaper than a polished youtube channel. Suitable for indies. Or a youtube promo for the ebook…

          The tech is starting to jell: SORA is showing the way.

          • Did you ever work with Windows old Movie maker app, where you stiched together a video out of a series of clips?

            Yes, although I think it was photos and audio I was stitching together. What you’re describing:

            you import a text and it attaches a timeline with markers for each page and you go page by page, highlighting blocks of text to be rendered or linking charts or graphs to be presented while the AI narrator “reads” the page

            … already exists with Final Cut or Adobe Premiere; even the AI narrator would just be an audio track on the timeline (sequence). Though I haven’t tested it, I gather daVinci Resolve is a free option akin to Final Cut and Premiere, which would get around the cost issue.

            But I’m thinking in terms of “assets”: the images, the animations, the AI narrators would all be assets imported into the video and placed on different sequence tracks. What I would want from Sora — whose videos need an audio track** — is a way to make those prompts into reusable assets. In the video of the woman swanning around the streets of Tokyo, could the woman be made into a character asset? Like if she were a character created in a video game toolset? Could the Tokyo-at-night setting be an asset?

            The assets are the sticking point, if the “video book” program had stock assets and customizable assets, that would be useful. After all, if a medieval historian wanted their book dramatized, they would need to explain cloth-of-gold, samite, dagged sleeves, farthingales, etc. But if the video book came pre-stocked with “medieval assets” — people, clothing, sets, items, etc. like the asset packs you get for video game creation –that would give the medieval history writer an easy entry into making the video book.

            **the head of the digital / video department at my old paper would remind the videographers that readers get confused when they see video but don’t hear audio. Always include a generic audio clip at least, so readers don’t complain that there’s something wrong with the sound: they expect audio and visual 🙂

            • Hah.
              Adobe has your back:

              https://www.msn.com/en-us/news/technology/adobe-s-latest-ai-experiment-generates-music-from-text/ar-BB1jbxsd?cvid=fb4bb475735d4f619249775edfcec701&ei=56

              AI fine-grained music generation.

              As to your desire for reusable assets, join the crowd.

              It is coming…step by step.
              The most recent feature lets ChatGPT remember prompts and keep them in “mind”. Still a long way from reusable assets but we’re still at the CLI stage of development. What we really need is the equivalent of reusable software objects. At the current pace of evolution, a year or two. Three to be safe. And yes, stock assets will be a tidy business for some.

              For now, I’m thinking book trailers and video ads are a good start for SORA. Later, things will get…interesting…

              The key thing is knowing where things are trending and the trend is AI everywhere, getting cheaper and more focused on real world uses.

  5. Now, where I do think AI is likely to make welcome inroads is AI translations of anime and video games. The American localizers have become notorious for their contempt of the source materials they’re supposed to be dubbing / subbing. They will completely change what the characters are saying, to reflect their own dumb ideas. And by “change” I don’t mean using synonyms, I mean they will completely alter what characters say, in plot-breaking ways.

    So I notice a lot of anime fans cheering the imminent unemployment of localizers in favor of AI. But that’s an own-goal, because they might have kept their jobs had they actually bothered to do them. But along comes AI, doing the job the humans wouldn’t: giving fans faithful translations of their favorite works.

  6. I like narrators who can do the appropriate inflections. If a text was written in italics, I expect the narrator to put emphasis on that word. If a character is using “air quotes,” I expect the narrator to sound sarcastic. And I want their voice to change appropriately if a character is exclaiming, speaking sotto voce, or deadpan.

    So an AI narrator might seem suitable for most kinds of non-fiction, like cookbooks or Ikea instructions*** maybe. But I wouldn’t use it if a subtle or warm touch is required. I don’t want my history narrated by Ben Stein, whether the living man or the AI equivalent. And of course, AI narrators will not be ready for prime time until it can handle fantasy / sci-fi.

    ***I heard those are terrible, but the only Ikea furniture I ever assembled was the Billy bookcase, after first putting a fabric overlay on the the interior back panel to jazz it up. And you don’t need instructions for the Billy case, so I didn’t read them.

  7. I’m probably in the minority, but I don’t care for AI narration. I listen to about 40 audiobooks a year and have since the late 1990s. My favorite narrator is the late Edward Herrmann. I recently heard a sample of AI using his voice and could immediately tell it was AI. The nuance he used had disappeared. AI won’t keep me from listening to an audiobook for non-fiction, but I’m not sure I’d listen to a novel narrated by AI.

Comments are closed.