Avoiding Common Ebook Errors

4 November 2011

From Vook, an ebook publishing platform:

As more readers buy eBooks, formatting and typographical errors are becoming a major frustration. In recent days, Walter Isaacson’s Steve Jobs was removed from the iBookstore and replaced with a new version because of formatting errors. One iBookstore reviewer wrote: “I want my money back. The formatting errors in the iBooks version are appalling. At first, a caption is missing or just a word, but it soon becomes illegible. The publisher should be ashamed.”

Similarly, the Amazon Kindle release of Neal Stephenson’s eBook Reamde recently made headlines because line breaks, missing passages, and hyphens preceding words such as “people” and “couple” were scattered throughout the eBook. One Amazon reviewer wrote: “…the reading experience is fatally tainted,” and demanded a full refund of the $16.99 price.

. . . .

The frequency and types of errors depend on how the eBook is produced. The most error-prone eBooks emerge from a process called optical character recognition (OCR), in which the print version of the title is scanned page-by-page. Backlist titles are often converted into eBooks this way because publishers do not have digital versions of printed books.

“There simply aren’t digital files for many books—even recent books,” said Pablo Defendini, Interactive Producer at Open Road Integrated Media. “Nowadays, books are converted from PDF, from very old versions of Quark, or if those files are corrupted or are difficult to open, the physical book is scanned.” Scanning books through OCR is imperfect and introduces errors. “It is a machine. It will continue to read a letter ‘r’ next to a letter ‘n’ as the letter ‘m’,”Defendini said.

. . . .

Problem #1: Headers with Hyphens

The Problem: Unsightly hyphens in headers look awful. InDesign doesn’t currently export hyphenation settings to EPUB and the narrow screens of eReaders typically break header text onto separate lines, complete with hyphens.

. . . .

Problem #2: Chapter titles, Headers, and Sub-Headers Separated on Different eBook Pages

The Problem: EPUBs are reflowable, and chapter titles, headers, and sub-headers can be split from the text that immediately follows. The result? An ugly EPUB that reads more like a webpage and less like a book.

. . . .

Pages Problem #3: Unsightly Indentation, and Random Blank Pages in the eBook

The Problem: Sloppy indentation can make an eBook difficult to read. Blank pages make the reader wonder if they’ve missed a key passage. These formatting errors are often caused when converting EPUBs from Microsoft Word. In that scenario, the user will use the “Tab” key to indent paragraphs, and the “Enter” key to insert line breaks, rather than using Word’s built in formatting styles.

Link to the rest at Vook.

If that link doesn’t work, go here - You’ll have to give Vook an email address to contact you about an upcoming beta, but you can say no when they contact you if you’re not interested.

This is a 13-page PDF with lots of good information whether you format your ebooks yourself or are checking the work of someone who formats them for you.

Ebook/Ereader Technical, Ebooks, Self-Publishing, Self-Publishing Warnings

16 Comments to “Avoiding Common Ebook Errors”

  1. To be honest, the first time I heard about a publisher allowing raw OCR to be used as the basis for an eBook, I was shocked and wondered how that could possibly be the case. Isn’t it common knowledge in the business world that OCR is imperfect and MUST be proofread and corrected? I can understand an individual author being that uninformed, but a business?

    There are only two possible conclusions to draw: either the company does not care about the end product or they are hopelessly clueless. At least you can fix the latter.

    Let’s hope they now understand and won’t be making that mistake anymore.

  2. What I find pitiful is that most of the errors can be spotted by running a spell check. (Not corrected, spotted.)

    Spell check will show you the type of errors in that document and give you a way to quickly find the gibberish.

    It seems silly their IT people aren’t aware of the issues.

  3. I get that older books may need to be OCRd into ebooks, but THEN you put a proofreader on it.

    There are half a dozen common OCR errors that someone who’s looking for them can find. Begin by noticing what errors you can see easily, do a search and replace run, then catch the rest by putting human eyeballs on the job.

    It ain’t rocket science, publishers.

  4. I think it’s great that Vook has drawn attention to what e-book readers already know. A lot of e-books from traditional publishers—the supposed “guardians of the book” who are supposedly protecting us from the “tsunami of indie crap”—are created with a complete lack of concern for the books or their readers. This, to me, is the worst way to build relationships with your core group of readers. Convince them to buy the book, then stick a thumb in their eye.

    Also, curious that your link goes around the Vook opt in, seems like a tech error on their part.

  5. It’s not the older books converted to e-books that surprise me, it’s the newer ones, books written in the last decade that (most likely) exist in a word processor format. I’m always boggled when I run into a bestseller released in the last two years whose e-book has runaway hyphens, em-dashes, line breaks and extra spaces all in the strangest places.

  6. Some of the things you mention are quite likely artifacts of the software used to convert the author’s e-file to print format. Again, human eyes are the magical ingredient, and judging by the numbers of big pub books I’ve seen with these artifacts, human eyes are what they’re not ready to pay for.

  7. I did OCR one of my backlist books. It was so much work/annoyance, it would have been quicker to simply type it from the book open in my lap. Every line had to be checked against the hardcover for scanning errors and there were a ton. Fortunately, no one has complained so I must have caught all of them but it made me vow never to do it again.

  8. One variable is the clarity of the type and the quality of the paper. Mass-market paperbacks come out with a lot of OCR errors compared to hardcovers on quality white paper. Small type is also more problem-prone than larger type.

  9. I’m sure the big publishing houses are no different from the rest of the corporate world; too few employees doing too much work with a few overpaid dimwits at the top making horrible decisions. They probably had a meeting with lots of colorful pie charts and free pastries where someone said:
    “These old backlists have already been edited before we published them ten years ago. No need to do it a second time. Let’s just scan em and get em listed.”

    “Is that all we need to do?”

    “Sure, trust me. If there’s one thing I know, it’s techie stuff.”

    “Weren’t you selling off-shore drilling equipment before we hired you?”

    “Doesn’t get more techie than that now does it? Anyone care if I take the last cruller?”

    Anyone think the poor guy told to scan twenty books a day is going to actually spell check them? He didn’t even get a free pastry so you know he’s disgruntled.

    Contrast that with the Indie who knows carelessness will only hurt his own sales. Who knows, maybe we will see a day where the big firms look more like sweatshops turning out cheap garbage and the indies (some of them at least) are seen as craftsmen (craftspeople?) turning out quality work.

    How long before the major writers at the big stables realize that the eBook market has surged ahead of print and decide to go it alone? It’s going to happen, especially under the current eRoyalty rates the big firms offer.

  10. So much for the argument that publishers are a guarantee of quality.

  11. There are lots of OCR-type errors popping up in backlist books – that’s clear.

    But for recent releases, we have seen high-profile errors in books from Pratchett, Stephenson, and lately, the Steve Jobs biography.

    It’s annoying that the media focus has been on the vendor (iBooks/Amazon) or the author rather than the publisher. The buck really stops with the publisher here.

    Obviously, as these are new releases, we aren’t talking about OCR errors. My guess is that there is some flaw in some automated conversion tool they are using. I think the general process in most publishers these days involves something like exporting an EPUB from InDesign or whatever, which is then, in turn, converted to MOBI for Amazon.

    Clearly, their conversion tool/process is very flawed. Personally, I wouldn’t trust these tools (or that process, to be honest).

    I hand-code the HTML myself for all my releases. I leaned how to do it from Guido Henkel’s excellent (free) online guide. It took a couple of days to figure out for my first release. By the time of my third release, it only took maybe three hours, and that was a non-fiction title (much more complicated) with lots of things like headings, hyperlinks, etc. which take the most time.

    Publishers are essentially saying that we can’t devote two or three man hours per title to get our formatting right. It’s something they only have to do once, and it really shows contempt for readers, especially when they are charging up to $14.99 for something.

  12. “There simply aren’t digital files for many books—even recent books,”

    OF COURSE THERE ARE. Please.

    I suggest a read of Kris Rusch’s recent rant about lousy Journalism these days. ALL current novels and going back for more than a decade were done by digital files. A competent journalist would have checked this statement. *eye roll*

    And then the crappy journalism is picked up and repeated. Lord, give me patience.

    David Gaughran is absolutely right that the lousy formatting when changing digital formats (which is what is involved) is nothing but utter contempt for their customers.

    • You’re right, JR.

      In addition to publisher’s files, every author I know keeps word processing files of their books almost forever.

      Even if the editor didn’t send an as-printed or almost as-printed word-processing file to the author, scanning a book and using Word’s Document Compare on the OCR file and the last word processing file is much faster and less error-prone than scanning alone.

  13. Even if you use the paragraph indent set in “style” Kindle doesn’t pick them up. We were just talking about the amazing things you can do with a computer but they can’t retain that code. I’m actually kind of relieved that big names have ebook problems as I always blame myself.

  14. Thus why it’s much better to have a human being on the other end rather than a machine….

  15. Here’s a recent article from the HuffPo about errors in e-books:

    http://huff.to/nt5ZXT

Sorry, the comment form is closed at this time.