Synthetic Voices Want to Take Over Audiobooks

From Wired:

WHEN VOICE ACTOR Heath Miller sits down in his boatshed-turned-home studio in Maine to record a new audiobook narration, he has already read the text through carefully at least once. To deliver his best performance, he takes notes on each character and any hints of how they should sound. Over the past two years, audiobook roles, like narrating popular fantasy series He Who Fights With Monsters, have become Miller’s main source of work. But in December he briefly turned online detective after he saw a tweet from UK sci-fi author Jon Richter disclosing that his latest audiobook had no need for the kind of artistry Miller offers: It was narrated by a synthetic voice.

Richter’s book listing on Amazon’s Audible credited that voice as “Nicholas Smith” without disclosing that it wasn’t human. To Miller’s surprise, he found that “Smith” voiced a total of around half a dozen on the site from multiple publishers—breaching Audible rules that say audiobooks “must be narrated by a human.” Although “Smith” sounded more expressive than a typical synthetic voice, to Miller’s ear it was plainly artificial and offered a worse experience than a human narrator. It made giveaway mistakes, like pronouncing Covid as “kah-viid” when referring to the pandemic.

Miller tracked down “Smith”—the voice matched a sample posted to SoundCloud by Speechki, a San Francisco startup that offers more than 300 synthetic voices for audiobook publishing across 77 dialects and languages. He and other narrators and audio fans who discussed the artificial audiobooks online reported the titles to Audible, which eventually removed them. Although it wasn’t a large number, discovering that synthetic voices were good enough for some publishers to put them to work prompted Miller to wonder about the future of his art and income. “It’s a little terrifying because it’s my livelihood and that of many people I respect,” he says.

Richter says he chose an artificial voice because the concept and its “uncanny valley” sound suited his book, which has a piece of intelligence software as one of its main characters, and that he was unaware of Audible’s policies. “My intention was never to upset or offend anyone,” he says. Speechki says it recommends publishers identify that narrations are synthetic and that it informs them of Audible’s policies. Will Farrell-Green, a senior director at Audible, said in an emailed statement that the company uses automated and manual processes to enforce its rules but that “due to the volume of content on our service, titles that are not compliant do slip through from time to time.” Audible’s “human’s only” policy dates back to at least 2014, when synthetic voices were much less convincing, and the company has said the rule helps provide listeners the performances they expect.

Synthetic voices have become less grating in recent years, in part due to artificial intelligence research by companies such as Google and Amazon, which compete to offer virtual assistants and cloud services with smoother artificial tones. Those advances have also been used to make reality-spoofing “deepfakes.” Speechki is one of several startups developing speech synthesis for audiobooks. It analyzes text with in-house software to mark up how to inflect different words, voices it with technology adapted from cloud providers including Amazon, Microsoft, and Google, and employs proof listeners who check for mistakes. Google is testing its own “auto-narration” service that publishers can use to generate English audiobooks for free, using more than 20 different synthetic voices. Audiobooks published through the program include an academic history of theater and a novelist’s exploration of cultural attitudes to sex. Google spokesperson Dan Jackson says its auto-narrated books supplement rather than replace professionally narrated books. “Our goal with auto-narration is to make it possible to create a low-cost audiobook for any ebook title and increase content accessibility for those that are unable to read via ebook,” he says.

Link to the rest at Wired

Here’s a sample of a synthetic voice from Speechki that was embedded in the OP.

Per the Speechki website, their software can produce an audio book in 15 minutes.

This page features an audiobook sample in Spanish.

19 thoughts on “Synthetic Voices Want to Take Over Audiobooks”

  1. My opinion is that competition is a good thing. Many authors, I believe, find it hard to shell out between $200 and $1,000 per finished hour, when an AI voice could do the same thing, and for nothing. It takes many sales to make up the cost–likely years in terms of time. Yes, there will be some differences, especially when trying to capture emotion, but even that may clear up in the near future.

  2. Listening to the clip was interesting. The voice does have a tinny sound to it, but I don’t know if I would have twigged that it was artificial if I came to it cold. There are many artificial audiobooks on YouTube that sounded worse.

    • Specialized news channels on youtube use similar tech and some soundgood enough you can only tell by the pauses.
      They don’t make anything of it; it’s just another form of presentation.

      The tech isn’t there yet but at some point TTS engines will evolve to do the job on the fly in a tablet off a straight ebook. Just a matter of time.

      It is probably doable today using “ai” and an XML format to markup the text for voice, intonation, and pronunciation. The “reader” could then transcribe the encoded phonems into natural-sounding voice.

      Amazon probably won’t do it because it would put Audible out if business.

  3. DeepZen’s offering a similar service, but the price point is way too expensive still. If they can bring it down to somewhere around minimum wage, they’ll roll in the dough. But I’m not going to pay real money to a machine when there are hungry actors starting out willing to do the job for cheaper.

  4. Keep in mind that the choice isn’t necessarily between a human-voiced audiobook and an AI one. Many times, we authors–while sympathetic to actors–simply can’t afford the price, especially since many readers expect to get audiobooks free from the library or other sources. In many cases, it will be a matter of a book being available in at least halfway decent audio to those who really need it.

  5. From their website: “$8000 is an average cost.” No, not even close. the average is closer to $2,000. The range is broad from $0 (ACX royalty share) to $50,000 hiring an A-list actor. A 100,000-word book should read at just under 11 hours. You can find plenty of great narrators for $150 to $250 per finished hour.

    I don’t see this working for fiction but do for non-fiction. Author’s considering this will need to listen carefully for mistakes or mispronounced words and ensure the service will fix those. That means authors will need to plan on spending several hours proof listening or hire someone to do that.

  6. Older kindles had this, and some older kindle tablets do too. I have a fire 6, which is the model they sold at christmas one year in a six pack. It has TTS, although it is slightly hidden at the bottom of a menu. It’s not super great, a bit monotone and it does miss the occasional word, but it is “good enough” much of the time. For example, I use it with bluetooth while doing the dishes or on a long drive, since I can pair the tablet with my in-car speaker. If you look at 1-star Kindle reviews you will still see old timers complaining that TTS is missing in their recent upgrade.

  7. I am also skeptical that this computer voice would work for fiction. The voice in the video would be serviceable for non-fiction and news reading, but the true test is how it would handle fantasy / sci-fi, with their non-standard words. Daenerys, Eilonwy, mithril, an angry Worf speaking Klingon, or a pensive Legolas speaking elvish — let’s see how the computer handles those.

    I still prefer the radio-play style of audiobooks, so I’m probably biased. But I would be impressed if they demonstrated a synthetic voice that could handle those challenges.

    • Wait a while.
      And not too long.

      Whatever Amazon might fret over, the gaming world will make it happen. Either Bethesda, Obsidian, CDPR, or BioWare. RPGs need hundreds/thousands of voices. And games are getting more expensive faster than sales. So cost savings will urge adoption.
      Or EPIC will.

      And once games refine the tech…

      Yes, non-fiction will be first in synthetic audiobook actors–especially news readers–but fiction won’t be far behind. As a W.A.G., I’d bet on LITFIC, romance, cozy mysteries, thrillers, and SF&F last. But lets remember SF&F aren’t the biggest genres anyway. Way more money in the mundane genres.

      So tech level + market size says the savings will drive adoption. Soon.

      BTW, for anybody interested in synthetic actor tech, Disney is in the lead but EPIC isn’t far behind:

      Parts of the intro are live video, parts are pre-rendered CGI, parts are on-the-fly XBOX SX graphics. And that’s consumer level hardware.
      An expert *might* be able to tell which is which.

  8. Check the voice in the post vs ten years ago. Human narrators will follow the monks with the beautiful hand. Consumers will be happy to pay less, and that will be a downward pressure on audio prices.

    A second pressure will be a large increase in audio titles due to the lower cost.

  9. Seems there will then be a special niche for “as read by author” books” – though they might be pricey, the listener knows it’s not a bot, and the author really does know (or should) the difference between affect (noun) and affect (verb), effect (noun) and effect (verb), as well as the much simpler read and read.

    Even inexperienced narrators will get fewer of those wrong than AI, especially for fiction.

    • Very likely.
      Much like author preferred editions or special letherbound collectors editions.
      But the economics of synthetic narration won’t be kind to any but the most popular of narrators and those are likeoy to end up licensing their vocal profile.

        • Hollywood now has the ability to create digital actors who look and sound like dead actors. Once created, the migital model can be enlessly reused for different productions. As in the Matrix clip I linked above.

          Audio can likewise record the vocal parameters and style of a given narrator and use it for a synthetic voice, say Jim Dale (of the Harry Potter books) without human intervention. We alrrady see it with the various Digital Asistants like Alexa, Siri, and Cortana. Cortana, for example uses the voice of Jen Taylor, the iconic voice actress from 20 years of HALO. Alexa, lets you swap the Voice of Samuel Jackson for the regular Alexa voice. Which, unlike Microsoft, Amazon doesn’t openly acknowledge. One report says the voice belongs to voiceover actress Nina Rolle:

          Whoever it is, the voice is computer generated according to the actress’ voice parameters. Current tech is pretty good but not as good as it can be. Down the road, the tech will likely be used to replicate the voice and style of classic singers, say Nat King Cole, for new songs they never sang.

          For now, pretty much all recognizable voices belong to living persons or their estates and must be licensed but as recently pointed out, early 20th century recordings are starting to fall into the public domain which in coming decades will render some iconic voices fair game. Say Bing Crosby, the Andrews Sisters, etc. Imagine a duet of Faux-King Cole and Faux-Crosby singing an entirely new song.

          Again: the tech will soon be able to do it. The commercial incentive exists.
          It’ll happen.

      • My guess: actors are paid to memorize a play or scenes for a movie. Writers, my kind, anyway, can’t remember why they wrote all that stuff, and don’t have it memorized. Different skills.

        Creating a symphony and playing the viola part are different skills.

        I’ve been noticing how many times movies with a writer/actor/director are lacking in one of the essential skills.

Comments are closed.