AI audiobooks take a big step towards the audio New Normal

From The New Publishing Standard:

Pretty much since smartphones became mainstream, audio content in the form of podcasts and audiobooks have been gathering momentum as a significant format sector in the global publishing industry.

Even with the à la carte and monthly credit subscription models audio has taken off big time with consumers, while in the markets where publishers are amenable to unlimited subscription audiobooks have quickly become a format to rival – and in the case of Sweden even to exceed – the popularity of print.

But the brake on audio – and especially on longform audiobooks – has always been the production costs of studios, sound engineers and narrators that can add thousands of dollars to the cost of a book as a sound product, deterring many publishers and making some titles financially unviable.

Lurking in the background as the audio industry discovered and embraced digital, was AI – artificial intelligence – with the futuristic promise and premise that one day an entire book could be narrated by a robot and no-one would know any better.

Well, we’re not there yet, but anyone who follows developments in this arena will know quality is accelerating, driven by the proven global demand for digital audio based on text-to-speech (TTS).

As an author I love the idea that one day I might, at the click of a mouse, convert my novels to saleable-quality audiobooks, and as an industry commentator writing TNPS I fantasise about the day I might hit the mouse and my TNPS posts be converted into podcasts.

In the real world it seemed like the latter might happen soonest, as TTS (text to speech) seems to be developing fastest in the non-fiction arena, where delivery relies less on emotion and more purveying information.

But the reality is when I try the latest sample AI offerings I hit one major obstacle – TNPS posts are so full of “foreign” names (as in not in the AI English names database) that the text converted to sound is quite unacceptable. Another couple of years and it might be a different story.

But for fiction, where conveying emotion and tone has been the problem, progress has been palpable, this week resulting in news that one AI-audio operator, UK-based DeepZen, has partnered with US distributor Ingram to offer its AI-audio services to a no doubt cautiously optimistic publishing industry.

Per the DeepZen press release,

The service uses innovative technology that replicates the human voice to create a listening experience that is virtually indistinguishable from the real thing. Developed specifically for audiobooks and long form content, it incorporates artificial intelligence, natural language processing, and next generation algorithms.

DeepZen’s AI voices are licensed from voice actors and narrators, capturing all of the elements of the human voice, such as pacing and intonation, and a wide range of emotions that produce more realistic speech patterns. They are benchmarked against human narration, and are a world away from the robotic, monotone, voice assistants with which we are all familiar.

But that still begs the question, are they a world away enough to be acceptable to paying consumers?

The 49 second sample DeepZen offers via the press release really isn’t enough to make that call, but check it out here and see – or rather hear – for yourself.

Link to the rest at The New Publishing Standard

Here’s a link to DeepZen where you can hear some AI voices

10 thoughts on “AI audiobooks take a big step towards the audio New Normal”

  1. VO actors are in a tizzy – some didn’t realize they were giving their employer a permanent license to use their voices however they saw fit, and are very unhappy about it. No royalties, no residuals, no nothing – and the digital repetition can be forever.

  2. Definitely getting better. On the selection link, Edward-Male-Fiction isn’t bad. Whereas Todd-Male-Fiction is not good. But I’m wondering how they handle different characters (male & female) with the same voice in the same book. Anyone here tried DeepZen yet?

  3. I think the DeepZen audio sample sounded like a totally generic computer-generated voice circa 2021. Sure, at some point in the future such AI voices will sound convincing, but I don’t see how this example is even a small step forwards.

    • Personally, I think it’s getting close. It only really has to be “good enough” for many to make the jump to it for audiobooks.

      I followed early print-on-demand (POD) tech from an image quality point of view early on. And it was awful. But slowly, over time, it got better and better until, presto, it was good enough, and then it was everywhere. I can still criticize and nit-pick POD, but nobody else cares; it’s very much accepted now. I expect something similar with AI voices.

      • I expect to see a functional version of AI audiobooks to emerge from Audible sometime in the next 5-10 years. It is inevitable as sound is a very easy data source to digitize and quantisize. The trick is teasing out inflections and tone from the text. And that is tight in the wheelhouse for inference engines. The first uses of AI speech in game development are already out:

        https://www.unite.ai/game-developers-look-to-voice-ai-for-new-creative-opportunities/

        With gaming (tens of billion$$$) as a driver (*not* audiobooks), the tech is going to be evolving at the classic internet-time speed. I’m probably pessimistic in my estimates.

        • Here is an excellent overview of how the tech was used developing the 2020 OUTER WORLDS RPG from OBSIDIAN.

          https://www.youtube.com/watch?v=YajBa5PO1Hk

          Of note, OBSIDIAN is now part of the Microsoft Game Studios lineup. Big pockets and lots of inhouse tech supporting them.
          (As in Microsoft’s own AI voice *product*.)
          Because gaming and audiobooks aren’t the only uses.
          (see next)

          • From Feb 2021:
            https://www.msn.com/en-us/money/other/heres-how-microsofts-azure-ai-creates-realistic-digital-voices/ar-BB1dnLF2#:~:text=What%20you%20need%20to%20know%201%20Microsoft%20Azure,using%20the%20technology%20responsibly%20in%20its%20blog%20post.

            “The real technology breakthrough is the efficient use of deep learning to process the text to make sure the prosody and pronunciation is accurate. The prosody is what the tone and duration of each phoneme should be. We combine those in a seamless way so they can reproduce the voice that sounds like the original person.

            If all of this sounds a bit familiar, you may have seen coverage of Microsoft’s patent for similar technology. The patent made the news because it the technology described within it could be used to create chatbots of dead people.

            Microsoft is aware of the fact that technology like this could be used in creepy or dishonest ways, and it talks about transparency in its blog post. Access to the technology is limited and requires disclosure of how it will be used. Microsoft explains:

            A conversation with Bugs Bunny might feel real, but everyone knows that it isn’t – because Bugs is a fictional character. That’s an important distinction, and one that Microsoft is careful to protect in every application of the technology. That’s a key reason Custom Neural Voice is limited access, meaning interested customers must apply and be approved by Microsoft to use the technology. In this case, general availability means it is ready for production and available in more Azure cloud regions, not that it is available to the general public.

            While many uses for Custom Neural Voice involve a fictional character, sometimes a customer wants the voice to be a real person, such as an author reading their own book. Even in those cases, it is important that people know the voice is synthetic, which is why Microsoft includes a disclosure requirement in its contract. ”

            The need for such disclaimers, by itself, points to how good the tech is getting.

            Botton line: the day is coming. Like all new tech, commercial use won’t be cheap initially, but it will be quickly “democratized”.

          • It won’t.
            As the Obsidian video makes clear, the current commercial systems either require human oversite or are narrow-focus (the AZURE system, which is emotion free, just fluid conversation). For an audiobook converter a lot of work needs to be done on the inference engines.
            But hard doesn’t mean impossible.

            Up front, I would expect third-person stories will work best early on tban first person, and more cerebral stories than emotion-based ones. Meaning is easier to tease than emotion. It is, after all, stuff humans get wrong all the time, on both sides.

            • Emotion is hard and subtlety in speech from tone, speed, or strategic pauses can make a big difference. (Tara Strong’s Ms. Minutes in the disney LOKI series is a master class in voice acting.)

              Still that kind of kill is rare and not required for audiobooks, both non-fiction or fiction. In particular, genre audiobooks aren’t necessarily about dramarizing the story but about drawing tbe listener in. Audio books aren’t audioplays. Though the trend to those is growing.

              The video game world is tbe opposite and skilled voice actors can make a game go, ahem, “Legendary”. 😀 AI voices may suffice for routine characters but main characters should remain human performances a while longer than most audiobooks.

Comments are closed.