From Wired:
Jodie Archer had always been puzzled by the success of The Da Vinci Code. She’d worked for Penguin UK in the mid-2000s, when Dan Brown’s thriller had become a massive hit, and knew there was no way marketing alone would have led to 80 million copies sold. So what was it, then? Something magical about the words that Brown had strung together? Dumb luck? The questions stuck with her even after she left Penguin in 2007 to get a PhD in English at Stanford. There she met Matthew L. Jockers, a cofounder of the Stanford Literary Lab, whose work in text analysis had convinced him that computers could peer into books in a way that people never could.
Soon the two of them went to work on the “bestseller” problem: How could you know which books would be blockbusters and which would flop, and why? Over four years, Archer and Jockers fed 5,000 fiction titles published over the last 30 years into computers and trained them to “read”—to determine where sentences begin and end, to identify parts of speech, to map out plots. They then used so-called machine classification algorithms to isolate the features most common in bestsellers.
The result of their work—detailed in The Bestseller Code, out this month—is an algorithm built to predict, with 80 percent accuracy, which novels will become mega-bestsellers. What does it like? Young, strong heroines who are also misfits (the type found in *The Girl on the Train, Gone Girl, *and The Girl with the Dragon Tattoo). No sex, just “human closeness.” Frequent use of the verb “need.” Lots of contractions. Not a lot of exclamation marks. Dogs, yes; cats, meh. In all, the “bestseller-ometer” has identified 2,799 features strongly associated with bestsellers.
What Archer and Jockers have done is just one part of a larger movement in the publishing industry to replace gut instinct and wishful thinking with data. A handful of startups in the US and abroad claim to have created their own algorithms or other data-driven approaches that can help them pick novels and nonfiction topics that readers will love, as well as understand which books work for which audiences. Meanwhile, traditional publishers are doing their own experiments: Simon & Schuster hired its first data scientist last year; in May, Macmillan Publishers acquired the digital book publishing platform Pronoun, in part for its data and analytics capabilities.
While these efforts could bring more profit to an oft-struggling industry, the effect for readers is unclear.
“Part of the beautiful thing about books, unlike refrigerators or something, is that sometimes you pick up a book that you don’t know,” says Katherine Flynn, a partner at Boston-based literary agency Kneerim & Williams. “You get exposed to things you wouldn’t have necessarily thought you liked. You thought you liked tennis, but you can read a book about basketball. It’s sad to think that data could narrow our tastes and possibilities.”
They Know What You Did Last Night
Once, publishers had to rely on unit sales to figure out what readers wanted. Digital reading changed that. Publishers can know that you raced through a novel to the end, or that you abandoned it after 20 pages. They can know where and when you’re reading. On some reading sites and apps, users sign in with their Facebook accounts, opening up more personal data. There’s a wrinkle, though: Companies such as Amazon and Apple have the data for books read on their devices, and they aren’t sharing it with publishers.
London-based startup Jellybooks offers a workaround. Publishers can hire Jellybooks to conduct virtual focus groups, giving readers free ebooks, often in advance of publication, in exchange for their sharing data on how much, when, and where they read. Javascript is embedded in the books, and at the end of each chapter, readers are asked to click a link that sends the data to Jellybooks. In almost two years, the company has run tests for publishers in the US, England, and Germany, and uncovered one sobering fact: Most novels are abandoned before readers are halfway through them. Jellybooks’s findings can guide publishers on their marketing, and even whether it’s worth signing an author again. “Hollywood moguls might do test screenings for movies to decide on how much [marketing] budget a movie should get,” says Andrew Rhomberg, the founder of Jellybooks. “That was never done for books.”
The ability to know who reads what and how fast is also driving Berlin-based startup Inkitt. Founded by Ali Albazaz, who started coding at age 10, the English-language website invites writers to post their novels for all to see. Inkitt’s algorithms examine reading patterns and engagement levels. For the best performers, Inkitt offers to act as literary agent, pitching the works to traditional publishers and keeping the standard 15 percent commission if a deal results. The site went public in January 2015 and now has 80,000 stories and more than half a million readers around the world.
Albazaz, now 26, sees himself as democratizing the publishing world. “We never, ever, ever judge the books. That’s not our job. We check that the formatting is correct, the grammar is in place, we make sure that the cover is not pixelated,” he says. “Who are we to judge if the plot is good? That’s the job of the market. That’s the job of the readers.”
. . . .
The Data Scare
As Archer and Jocker shopped the *Bestseller Code *manuscript to acquisitions editors, word of their powerful algorithm spread—as did worry and suspicion among those in the publishing profession. “The fear is we can homogenize the market or try and somehow take their jobs away from them, and the answer is no and no,” says Archer. “What the bestseller-ometer is trying to do is say, ‘Hey, pick this new author that you might not dare take a risk on with your acquisitions budget. Their chance is really good.’” Archer, now a writer in Boulder, Colorado, insists that she and Jockers, now an English professor at the University of Nebraska-Lincoln, are “literature-friendly” and want good books to succeed.
Andrew Weber, the global chief operating officer for Macmillan Publishers—whose St. Martin’s Press is publishing *The Bestseller Code—thinks algorithms should be viewed as an additional piece of information, rather than as an excuse to fire the editors. “Whether it’s in acquisition, whether it’s in pricing, whether it’s in marketing, whether it’s in distribution, there just seem to be many, many, many opportunities to improve the quality of our decision-making—and therefore hopefully our results—*by bringing data into the equation,” says Weber. “I would say we are still in the early days of that journey, but that’s the direction we’re headed.”
Archer and Jockers watched eagerly to see which novel would be their algorithm’s favorite. It turned out to be The Circle, a 2013 technothriller by Dave Eggers about working for a massively powerful Internet company. The Circle spent multiple weeks on both The New York Times hardcover fiction and paperback trade fiction bestseller lists. A movie version starring Emma Watson and Tom Hanks is expected in theaters this year.
Link to the rest at Wired
It appears that PG missed this when it first appeared in 2016.
He suspects the almost-universal phobia towards computers, algorithms, quantitative analysis, sophisticated metrics, etc., among the indwellers of traditional publishing is related to the widespread incidence of innumeracy among English majors.
Worship of The Golden Gut is the state religion of this group. For them, no collection of numbers and formulae can ever replace The Hunch. That’s one reason why so many books fail to earn out their advances, how many mega-sellers are first rejected by every major publisher before stumbling into the market and finding success.
Indie authors include a much wider slice of humanity than either publishers or traditionally-published authors. That diversity of talent and background combined with Amazon’s relentless pursuit of customers and, thus, numbers, analytics, categories, sub-categories and sub-sub categories fosters the creation of niches within niches all the way down to the micro-reader level.
PG just checked a random book on the Zon and discovered that it encouraged drill-down and discovery as follows:
Books
* Mystery, Thriller & Suspense
*Thrillers & Suspense
* Suspense
With broad categories mentioned:
Book Fiction Moods
Book Mystery Characters
Some Authors:
Author
(PG is not certain how much of this collection of information is presented as result of PG’s and Mrs. PG’s past buying habits.)
Finally, if you prefer, you could check out 383 different categories, series, spinoffs, heroes/heroines, etc., etc., etc., (including, 盗墓笔记, El cementerio de los libros, Svartåsen and Die Krimi-Serie in den Zwanzigern as follows:
It’s a lot worse than PG says. Publishing executives aren’t precisely innumerate — they can plug numbers into spreadsheets to make decisions for them all day long. The problem is that the entertainment industry is anti-science, perhaps the epitome of C.P. Snow’s “Two Cultures” problem: Because the subject of the entertainment industry is “art,” gathering and evaluating actual data gets no attention whatsoever.
Consider the most-obvious flaw in the Archer and Jockers project (other, that is, than that it is extolled as cutting-edge in Wired, which is almost always a tinfoil-hat warning): It was conducted on each work as if each work stood alone. Not just “part of a series,” but “in time” and “in immediate social context.” Consider the rise in badly-limned followers-of-Mohammed “bad guys” starting in the late 1990s and accelerating rapidly afterward, compared to followers-of-Stalin “bad guys” in the same period. And so on.
And as flawed as that project is, it’s vastly more searching than anything done in the publishing industry. For example, there’s a meme that trade books with predominantly green covers don’t sell. It was out of date when it was thrown at me in the 1990s: It was based on a combination of lighting characteristics and ink chemistries from the early 1960s that existed practically nowhere by 1990. Similarly for “embossed lettering sells books” (how does that work on Amazon, BTW — let alone for e-books, when the metallic shades most common in embossed lettering get distorted by 56 different types of displays?). But I’ve seen both of these memes presented as absolute, irrefutable fact in the last two years.
The real problem is that management doesn’t want to know anything that might require it to make expensive changes to its existing system.
Because the subject of the entertainment industry is “art,” gathering and evaluating actual data gets no attention whatsoever.
TV ratings do seem to get a passing glance.
… which assumes:
That TV ratings are meaningful as to actual perception of the programming (example: how much of Super Bowl ratings are for the commercials?)
That the means of gathering TV ratings bear some relationship to reality
That the time over which TV ratings are gathered for different programs allows comparison (is “same day plus three” comparable for football games, weekday soap operas, Judge Judy, Survivor, and the final episode of M*A*S*H?)
That “ratings” directly correlate to “enthusiasm”
Remember, the ratings are not used to establish anything except “rate chargeable to advertisers”; they don’t actual reflect revenue (there is more than one program currently airing with higher ratings that can’t ordinarily sell all of its ad time at the full rate, for example)
In short, “tracking stuff to which a number can be attached” is not necessarily gathering valid data.
… which assumes:
It assumes nothing.
It’s an observation that ratings are actual data that gets significant attention from the entertainment industry. And, as you say, that data reflects what advertisers will pay. Price is one of the important decision variables in a for-profit operation.
And is entertainment art? I don’t know. Who cares? Not consumers, not advertisers, and not the entertainment industry. Maybe artists care.
The people who interpret the data do, in fact, assume every one of the things that CE mentioned and you so airily dismissed. Raw data is of very little use unless you understand what you are measuring, why it is (or isn’t) important, and what its limitations are.
The claim was, “gathering and evaluating actual data gets no attention whatsoever.”
If one doesn’t understand any of the things I mentioned, one is not paying any attention whatsoever to evaluating the data. I suppose you could quibble about whether ‘gathering and evaluating’ is to be construed conjunctively or disjunctively.
I’m too dumb to know what that means.
Then you might not be the person best equipped to assess the quality of someone else’s data analysis.
Equipped conjunctively or disjunctively?
The dictionary is your friend.
The dictionary is your friend.
I’m still too dumb to know what that means.
At least you admit it.
Uh, ratings *aren’t* actual data.
They are algorithmically processed incomplete data.
Instead of actual viewers, they are a weighted aggregate of live viewers, dvr viewers, and first-week streaming viewers. With each category getting a different multiplier.
That is why the highest ratings go to sports and why advertisers pay more for them: they aren’t dvr-able and viewers can’t pause and return a day later.
Nielsen doesn’t even collect full data. They estimate viewership by varying means, some of which are selfreported.
It is simply a standardized metric that ranks programs’ estimated mass appeal. It doesn’t even come close to measuring profitabilty since many low-rated shows are gold mines for the producers while higher rated productions are money losers which means the metric isn’t even good at what it is really supposed to measure. Think of saddle shows: half hour shows slotted between two popular shows. They have good ratings whike carried by the saddle but tank when set as the lead show in a block because the show lacks intrinsic appeal. More often than not it is simply prefered to watching the second half hour of a one hour show.
So no, ratings aren’t real data and the TV world knows it.
They just have nothing better.
I agree ratings data don’t show profitability. Who but the conjunctively disjunctive think they come close? They don’t consider costs.
If they used a training set of 5,000 books, how did the algorithm perform when faced with a test of another 5,000 books it had never seen?
I’d also ask why they chose to publish rather than shop their system to publishers.