Where Is All the Book Data?

This content has been archived. It may no longer be accurate or relevant.

From Public Books:

Culture industries increasingly use our data to sell us their products. It’s time to use their data to study them. To that end, we created the Post45 Data Collective, an open access site that peer reviews and publishes literary and cultural data. This a partnership between the Data Collective and Public Books, a series called Hacking the Culture Industries, brings you data-driven essays that change how we understand audiobooks, bestselling books, streaming music, video games, influential literary institutions such as the New York Times and the New Yorker, and more. Together, they show a new way of understanding how culture is made, and how we can make it better.

—Laura McGrath and Dan Sinykin

. . . .

After the first lockdown in March 2020, I went looking for book sales data. I’m a data scientist and a literary scholar, and I wanted to know what books people were turning to in the early days of the pandemic for comfort, distraction, hope, guidance. How many copies of Emily St. John Mandel’s pandemic novel Station Eleven were being sold in COVID-19 times compared to when the novel debuted in 2014? And what about Giovanni Boccaccio’s much older—14th-century—plague stories, The Decameron? Were people clinging to or fleeing from pandemic tales during peak coronavirus panic? You might think, as I naively did, that a researcher would be able to find out exactly how many copies of a book were sold in certain months or years. But you, like me, would be wrong.

I went looking for book sales data, only to find that most of it is proprietary and purposefully locked away. What I learned was that the single most influential data in the publishing industry—which, every day, determines book contracts and authors’ lives—is basically inaccessible to anyone beyond the industry. And I learned that this is a big problem.

The problem with book sales data may not, at first, be apparent. Every week, the New York Times of course releases its famous list of “bestselling” books, but this list does not include individual sales numbers. Moreover, select book sales figures are often reported to journalists—like the fact that Station Eleven has sold more than 1.5 million copies overall—and also shared through outlets like Publishers Weekly. However, the underlying source for all these sales figures is typically an exclusive subscription service called BookScan: the most granular, comprehensive, and influential book sales data in the industry (though it still has significant holes—more on that to come).

Since its launch in 2001, BookScan has grown in authority. All the major publishing houses now rely on BookScan data, as do many other publishing professionals and authors. But, as I found to my surprise, pretty much everybody else is explicitly banned from using BookScan data, including academics. The toxic combination of this data’s power in the industry and its secretive inaccessibility to those beyond the industry reveals a broader problem. If we want to understand the contemporary literary world, we need better book data. And we need this data to be free, open, and interoperable.

Fortunately, there are a number of forward-thinking people who are already leading the charge for open book data. The Seattle Public Library is one of the few libraries in the country that releases (anonymized) book checkout data online, enabling anyone to download it from the internet for free. It isn’t book sales data, but it’s close. And such data might help us understand how the popularity of certain books fluctuates over time and in response to historical events like the COVID-19 pandemic (especially if more libraries around the country join the open data effort). Literary scholars have also begun to compile “counterdata” about the publishing industry. Richard So, a professor of English and cultural analytics at McGill University, and Laura McGrath, an English professor at Temple University, have respectively collected data about the race and ethnicity of authors published by mainstream publishing houses. Through their work, So and McGrath each prove that the Big Five houses have historically been dominated by white authors and that they continue to systematically reinforce whiteness today.

While all of this data is powerful in its own right, it becomes even more powerful if we can combine it all together: if we can merge author demographic data with library checkout data or with other literary trends. This promise anchors the Post45 Data Collective, an open-access repository for literary and cultural data that was founded by McGrath and Emory professor Dan Sinykin, and that I now lead as a coeditor with Sinykin. One of the goals of the repository is to help researchers get credit for the data that they painstakingly collect, clean, and share. But a broader goal is to share free cultural data with anybody who wants to reuse and recombine it to better understand contemporary literature, music, art, and more.

. . . .

BookScan’s influence in the publishing world is clear and far-reaching. To an editor, BookScan numbers offer two crucial data points: (1) the sales history of the potential author, if it exists, and (2) the sales history of comparable, or “comp,” titles. These data points, if deemed unfavorable, can mean a book is dead in the water.

Take it from freelance editor Christina Boys, whom I spoke with over email, and who worked for 20 years as an editor at two of the Big Five publishing houses (Simon & Schuster and Hachette Book Group). Boys told me that BookScan data is “very important” for deciding whether to acquire or pass on a book; BookScan is also used to determine the size of an advance, to dictate the scale of a marketing campaign or book tour, and to help sell subsidiary rights like translation rights or book club rights. “A poor sales history on BookScan often results in an immediate pass,” Boys said.

Clayton Childress, a sociologist at the University of Toronto, came to similar conclusions in his 2012 study of BookScan data, in which he interviewed and observed more than 40 acquisition editors from across the country. Bad book sales numbers can haunt an author “like a bad credit score,” Childress reported, and they can “caus[e] others to be hesitant to do business with them because of past failures.”

According to editors like Boys, the sway of book sales figures has siphoned much of the creativity and originality out of contemporary book publishing. “There’s less opportunity to acquire or promote a book based on things like gut instinct, quality of the writing, uniqueness of an idea, or literary or societal merit,” Boys claimed. “While passion—arguing that a book should be published—still matters, using that as a justification when it’s contrary to BookScan data has become increasingly challenging.” In a similar vein, Anne Trubek, the founder and publisher of the independent press Belt Publishing, told me that BookScan data is a strong conservative force in the industry—one of the reasons, though not the only reason, that Belt Publishing stopped subscribing after only one year. Trubek says that BookScan data encourages publishers to keep recycling the same kinds of books that sold well in the past. “I didn’t want to be a publisher who was working that way,” she elaborated. “That was not interesting. I think a lot of Big Five publishing is driven by data, and I think that things end up much more unimaginative as a result.”

Despite these claims, other publishing professionals maintain that BookScan data has not changed their work quite as dramatically. Childress interviewed one editor who explained that he manages to use BookScan data in creative ways to support his own independent choices. Yet even when editors find inventive ways to use BookScan data and to preserve their own aesthetic judgment, it is striking that they must still use and reckon with BookScan data in some form.

Perhaps most importantly, however, it is likely that books end up much more racially homogenous—that is, white—as a result of BookScan data, too. For example, in McGrath’s pioneering research on “comp” titles (the books that agents and editors claim are “comparable” to a pitched book), she found that 96 percent of the most frequently used comps were written by white authors. Because one of the most important features of a good comp title is a promising sales history, it is likely that comp titles and BookScan data work together to reinforce conservative white hegemony in the industry.

. . . .

For all of its extensive influence, most of us outside the publishing industry know surprisingly little about BookScan data: how much it costs, what it looks like, or what exactly it includes and measures. According to a 2009 business study, publishing house licenses for BookScan data cost somewhere between $350,000 and $750,000 a year at that time. Literary agents, scouts, and other publishing professionals can subscribe to NPD Publishers Marketplace for the humbler baseline price of $2,500 a year, and many authors can view their own BookScan data for free via Amazon.

But academics and almost everyone else are out of luck. When I inquired about getting access to BookScan data directly through NPD Group (the market research company that bought US BookScan from Nielsen in 2017), a sales specialist told me: “There are some limitations to who we are permitted to license our BookScan data to. This includes publishers, retailers, book distributors, publishing arms of universities, university presses and author agents. Do you fall within one of these categories?” When I reached out to NPD Publishers Marketplace, they told me the same thing. David Walter, executive director of NPD Books, confirmed that NPD does not license data to academic researchers: “We only license to publishers and related businesses, and … our license terms preclude sharing of any data publicly, which conflicts with the need to publish academic research. That is why we do not license data for the purposes of academic research.”

Link to the rest at Public Books

PG notes that the OP continues to delve into the details and problems of excluded data in BookScan and he recommends reading the article in its entirety.

PG has written about BookScan on several prior occasions. BookScan is presently owned by Hellman & Friedman, a private equity investment firm headquartered in San Francisco with offices in New York City and London.

This structure means that BookScan’s activities and finances are watched carefully by a group of numbers guys and numbers gals (although all the big bosses appear to be guys).

A quick look at H&F’s portfolio companies reveal 77 present and past subsidiaries that are all over the board. Insurance, cloud computing, home décor products, home security, customer experience management, energy and metals research, etc., etc.

H&F’s description of the companies it formerly owned/invested in shows more than a few that the company purchased, rehabbed and resold.

While sampling H&F’s past and present portfolio companies, an old term floated into PG’s mind, Pump and Dump. Pump and Dump involves acquiring shares in a publicly-held company, then fraudulently inflating the price of shares of stock of that company and selling out while the share prices are high. Such activity was often followed by a decline in the price of the company’s shares.

Pump and Dump is illegal and PG is not suggesting that H&F’s activities with its past or present portfolio companies constitutes an illegal Pump and Dump scheme.

However, the company does list former portfolio companies it acquired and later sold presumably for a higher price after increasing the health and value of the company.

The traditional publishing industry’s reliance on BookScan for a whole lot of decisions that impact authors is, as the OP implies, close to a religion.

If anyone in the traditional publishing business asked PG for his opinion regarding the tracking of book sales (pigs flying is more likely to occur), he would advise developing an analytics system that sliced and diced sales on Amazon in a large variety of ways.

While not ignoring BookScan completely, PG suspects that publishers would gain more actionable data from watching sales (and returns) of their ebook and print editions in close to real-time from the world’s largest bookstore instead of a collection of traditional retail outlets that have been losing market share in books for a very long time.

In PG’s monumentally humble opinion, those people who regularly purchase books from a physical bookstore are not representative of the book-buying public as a whole.

16 thoughts on “Where Is All the Book Data?”

  1. It’s sort of a weird stance, that incorrect and incomplete information should be available to everyone.

    Bookscan seems to pretend most indie publishing doesn’t exist, so it doesn’t really matter how accessible their data is.

  2. As a side note, I had to snort at the line about “conservative white hegemony.” White? Sure, although that’s called “the consequences of whites being the racial supermajority in America for centuries.”

    Conservative? Maybe by the standards of academia, but something like 90% of all publishing house employees vote Democrat, and you’re much more likely to find books that denigrate right-wing identity groups than left-wing ones.

  3. I snort very loudly at the “85% coverage” claim. It is possible that in three particular publishing industries (out of 13) — trade fiction, trade nonfiction, and trade children’s (shorthand, includes MG and YA too) — BookScan might reach 85%. On an extremely good day. For sales through brick-and-mortar outlets that categorize themselves as “bookstores.”

    That’s somewhere under 30% of “print publishing” (by revenue or by units) without considering ebooks.

  4. The author of the piece seems to want all BookScan data publicly available at no charge, but offered no guidance as to how the service would be paid for. I wouldn’t complain, but I’m assuming this sort of Santa Claus economics is not really going to work well.

  5. I don’t get it. What’s wrong with publishing books that people want to buy, as opposed to books that people don’t want to buy?

  6. 1. “Through their work, So and McGrath each prove that the Big Five houses have historically been dominated by white authors and that they continue to systematically reinforce whiteness today.”

    2. “A poor sales history on BookScan often results in an immediate pass,” Boys said.

    Does BookScan include author pigmentation?

  7. 1. OP: “many authors can view their own BookScan data for free via Amazon…”

    Does anyone know how to do this on Amazon? I find a lot of outdated ‘information’ like this, and when I go to Amazon to check it, it doesn’t look like the blog post author states it does – or it does not exist where it supposedly (formerly maybe?) is said to exist.

    Not very helpful, if you can only look at your own (is the BookScan data available for free for traditionally published authors, or can indies see their own data, too?), but it would be a good idea to record it annually or semi-annually for future reference.

    2. A propos of all this, does anyone know what happened to Author Earnings? Is it available to individuals for pay? Or did they take it corporate? It was interesting while it lasted, and then poof!

    • Most likely, Alicia, BookScan numbers are only available to real authors – not the likes of us. They aggregate from a network of brick and mortars, along with whatever online vendor pass through numbers they get from their “partners” – medium to large to massive publishers (mostly massive). Amazon, if they were to give us the button, would be reporting “no data” for the vast majority of the writers.

      As to Author Earnings – yes, they went corporate, and for big bucks, as Felix just noted. Which is not a “sell out,” by any means. While their analysis algorithms cannot be very complex, there is a large investment in maintaining big data, and in acquiring it – remember that it started out as a comparison between Amazon rankings and what hard data they could acquire from some writers to make an estimate of the correlation between ranking and sales. Amazon, as we know, doesn’t release any public granular sales numbers – and they don’t archive the changes for anyone, including the writers. It’s an expensive operation now that it’s past the proof of concept stage that we were seeing early on.

      • Ah, yes, that is the place; I stand corrected as to the reply to Alicia.

        However… “NPD BookScan estimates they report 85% of all retail print book sales” Definitely still very limited in utility for the majority of writers.

        Completely worthless, IMHO, unless you are trying to keep tabs on a book that is sold through tradpub.

Comments are closed.