Algorithms Could Save Book Publishing—But Ruin Novels

From Wired:

Jodie Archer had always been puzzled by the success of The Da Vinci Code. She’d worked for Penguin UK in the mid-2000s, when Dan Brown’s thriller had become a massive hit, and knew there was no way marketing alone would have led to 80 million copies sold. So what was it, then? Something magical about the words that Brown had strung together? Dumb luck? The questions stuck with her even after she left Penguin in 2007 to get a PhD in English at Stanford. There she met Matthew L. Jockers, a cofounder of the Stanford Literary Lab, whose work in text analysis had convinced him that computers could peer into books in a way that people never could.

Soon the two of them went to work on the “bestseller” problem: How could you know which books would be blockbusters and which would flop, and why? Over four years, Archer and Jockers fed 5,000 fiction titles published over the last 30 years into computers and trained them to “read”—to determine where sentences begin and end, to identify parts of speech, to map out plots. They then used so-called machine classification algorithms to isolate the features most common in bestsellers.

The result of their work—detailed in The Bestseller Code, out this month—is an algorithm built to predict, with 80 percent accuracy, which novels will become mega-bestsellers. What does it like? Young, strong heroines who are also misfits (the type found in *The Girl on the Train, Gone Girl, *and The Girl with the Dragon Tattoo). No sex, just “human closeness.” Frequent use of the verb “need.” Lots of contractions. Not a lot of exclamation marks. Dogs, yes; cats, meh. In all, the “bestseller-ometer” has identified 2,799 features strongly associated with bestsellers.

What Archer and Jockers have done is just one part of a larger movement in the publishing industry to replace gut instinct and wishful thinking with data. A handful of startups in the US and abroad claim to have created their own algorithms or other data-driven approaches that can help them pick novels and nonfiction topics that readers will love, as well as understand which books work for which audiences. Meanwhile, traditional publishers are doing their own experiments: Simon & Schuster hired its first data scientist last year; in May, Macmillan Publishers acquired the digital book publishing platform Pronoun, in part for its data and analytics capabilities.

While these efforts could bring more profit to an oft-struggling industry, the effect for readers is unclear.

“Part of the beautiful thing about books, unlike refrigerators or something, is that sometimes you pick up a book that you don’t know,” says Katherine Flynn, a partner at Boston-based literary agency Kneerim & Williams. “You get exposed to things you wouldn’t have necessarily thought you liked. You thought you liked tennis, but you can read a book about basketball. It’s sad to think that data could narrow our tastes and possibilities.”

They Know What You Did Last Night

Once, publishers had to rely on unit sales to figure out what readers wanted. Digital reading changed that. Publishers can know that you raced through a novel to the end, or that you abandoned it after 20 pages. They can know where and when you’re reading. On some reading sites and apps, users sign in with their Facebook accounts, opening up more personal data. There’s a wrinkle, though: Companies such as Amazon and Apple have the data for books read on their devices, and they aren’t sharing it with publishers.

London-based startup Jellybooks offers a workaround. Publishers can hire Jellybooks to conduct virtual focus groups, giving readers free ebooks, often in advance of publication, in exchange for their sharing data on how much, when, and where they read. Javascript is embedded in the books, and at the end of each chapter, readers are asked to click a link that sends the data to Jellybooks. In almost two years, the company has run tests for publishers in the US, England, and Germany, and uncovered one sobering fact: Most novels are abandoned before readers are halfway through them. Jellybooks’s findings can guide publishers on their marketing, and even whether it’s worth signing an author again. “Hollywood moguls might do test screenings for movies to decide on how much [marketing] budget a movie should get,” says Andrew Rhomberg, the founder of Jellybooks. “That was never done for books.”

The ability to know who reads what and how fast is also driving Berlin-based startup Inkitt. Founded by Ali Albazaz, who started coding at age 10, the English-language website invites writers to post their novels for all to see. Inkitt’s algorithms examine reading patterns and engagement levels. For the best performers, Inkitt offers to act as literary agent, pitching the works to traditional publishers and keeping the standard 15 percent commission if a deal results. The site went public in January 2015 and now has 80,000 stories and more than half a million readers around the world.

Albazaz, now 26, sees himself as democratizing the publishing world. “We never, ever, ever judge the books. That’s not our job. We check that the formatting is correct, the grammar is in place, we make sure that the cover is not pixelated,” he says. “Who are we to judge if the plot is good? That’s the job of the market. That’s the job of the readers.”

. . . .

The Data Scare

As Archer and Jocker shopped the *Bestseller Code *manuscript to acquisitions editors, word of their powerful algorithm spread—as did worry and suspicion among those in the publishing profession. “The fear is we can homogenize the market or try and somehow take their jobs away from them, and the answer is no and no,” says Archer. “What the bestseller-ometer is trying to do is say, ‘Hey, pick this new author that you might not dare take a risk on with your acquisitions budget. Their chance is really good.’” Archer, now a writer in Boulder, Colorado, insists that she and Jockers, now an English professor at the University of Nebraska-Lincoln, are “literature-friendly” and want good books to succeed.

Andrew Weber, the global chief operating officer for Macmillan Publishers—whose St. Martin’s Press is publishing *The Bestseller Code—thinks algorithms should be viewed as an additional piece of information, rather than as an excuse to fire the editors. “Whether it’s in acquisition, whether it’s in pricing, whether it’s in marketing, whether it’s in distribution, there just seem to be many, many, many opportunities to improve the quality of our decision-makingand therefore hopefully our results—*by bringing data into the equation,” says Weber. “I would say we are still in the early days of that journey, but that’s the direction we’re headed.”

Archer and Jockers watched eagerly to see which novel would be their algorithm’s favorite. It turned out to be The Circle, a 2013 technothriller by Dave Eggers about working for a massively powerful Internet company. The Circle spent multiple weeks on both The New York Times hardcover fiction and paperback trade fiction bestseller lists. A movie version starring Emma Watson and Tom Hanks is expected in theaters this year.

Link to the rest at Wired

It appears that PG missed this when it first appeared in 2016.

He suspects the almost-universal phobia towards computers, algorithms, quantitative analysis, sophisticated metrics, etc., among the indwellers of traditional publishing is related to the widespread incidence of innumeracy among English majors.

Worship of The Golden Gut is the state religion of this group. For them, no collection of numbers and formulae can ever replace The Hunch. That’s one reason why so many books fail to earn out their advances, how many mega-sellers are first rejected by every major publisher before stumbling into the market and finding success.

Indie authors include a much wider slice of humanity than either publishers or traditionally-published authors. That diversity of talent and background combined with Amazon’s relentless pursuit of customers and, thus, numbers, analytics, categories, sub-categories and sub-sub categories fosters the creation of niches within niches all the way down to the micro-reader level.

PG just checked a random book on the Zon and discovered that it encouraged drill-down and discovery as follows:

Books
* Mystery, Thriller & Suspense
*Thrillers & Suspense
* Suspense

With broad categories mentioned:

Book Fiction Moods

Book Mystery Characters

Some Authors:

Author

(PG is not certain how much of this collection of information is presented as result of PG’s and Mrs. PG’s past buying habits.)

Finally, if you prefer, you could check out 383 different categories, series, spinoffs, heroes/heroines, etc., etc., etc., (including, 盗墓笔记, El cementerio de los libros, Svartåsen and Die Krimi-Serie in den Zwanzigern as follows:

1900-Zombie-Thriller (1)
2A Cotten Stone Mystery (1)
3A Department Q Novel (1)
4A Jonathan Grave Thriller (2)
5A Topsail Island novel (1)
6Aaron Falk (2)
7Against Series / Raines of Wind Canyon (1)
8Agatha Raisin Sammelband (1)
9Agent Juliet (1)
10Agent Pendergast (4)
11Alex Cross (4)
12Alex Delaware (13)
13Alex Devlin (1)
14Alex Hawke (6)
15Alex McKnight (1)
16Alexandra Cooper (5)
17Alfonzo (1)
18Ali Reynolds Series (15)
19All Souls Trilogy (1)
20Allison McNeil Series (1)
21Alo Nudger (1)
22Amos Decker (5)
23An Elvis Cole Novel (3)
24An FBI Thriller (1)
25An Isaiah Coleridge Novel (1)
26An Under Suspicion Novel (1)
27Anderswelt John Sinclair Spin-off (18)
28Andreas Gruber Erzählbände (1)
29Anna Pigeon Mysteries (1)
30Annie Carter Series (3)
31Ash Henderson (2)
32Asher Benson (1)
33Auftrag: Mord! (3)
34Beartooth, Montana (1)
35Ben Abbott Mysteries (1)
36Ben Hope (20)
37Blood on Snow (2)
38Bob Lee Swagger Novels (4)
39Breaking Free (1)
40Camel Club (2)
41Cape Charade (3)
42Carl Mørck (1)
43Carriage House (5)
44Carson Ryder (9)
45Casey Woods (1)
46Cat Who… (1)
47Cate Austin (1)
48Charlie Chan Mystery (1)
49Chefinspektor Tony Braun (2)
50Cherokee Pointe (1)
51Chet and Bernie Mystery (16)
52Chronicles of The One (7)
53Cold Justice (1)
54Commandant Martin Servaz (1)
55Commissario Brunetti (8)
56Conrad Yeats Adventure (1)
57Cork O’Connor (17)
58Cork O’Connor Mystery Series (12)
59Cotton Malone (2)
60Covert-One (1)
61Crissa Stone (1)
62Cutler (2)
63D.I. Callanach (1)
64Dagny Gray (1)
65Dalziel & Pascoe (1)
66Dalziel and Pascoe (14)
67盗墓笔记 (1)
68Dark Iceland (1)
69Dave Gurney (1)
70Dave Robicheaux (8)
71David Stein (1)
72David Wolf (1)
73DCI Matilda Darke (1)
74Dead series (1)
75Detective Erika Foster (2)
76Detective Josie Quinn (2)
77Detective Mark Heckenburg (3)
78Detective Max Rupert (2)
79Detektei Lessing Kriminalserie (3)
80DI Fawley (2)
81Die ARES-Reihe (2)
82Die Cormoran-Strike-Reihe (1)
83Die Dead-Silencer-Saga (1)
84Die Irene-Huss-Krimis (1)
85Die Krimi-Serie in den Zwanzigern (23)
86Dirk Pitt (1)
87Dismas Hardy (15)
88Divine (1)
89Dr. Lazlo Kreizler (1)
90Dr. Marissa Blumenthal (1)
91Dr. Samantha Owens series (1)
92Drake Ramsey (2)
93DS Heckenburg (6)
94DS Imogen Grey (2)
95Dunkle Begierde (1)
96Dynam (1)
97Ed Eagle Novel (2)
98Ein Fall für Engel und Sander (2)
99Ein FBI Thriller mit Dillon Savich und Lacey Sherlock (3)
100Ein Jack-Reacher-Roman (1)
101Ein Mike-Köstner-Thriller (1)
102El cementerio de los libros olvidados (1)
103EL SECRETO DE LOS ARTISTAS (1)
104Emma Fern (4)
105Enrico Mancini (2)
106Essex Witch Museum Mystery (2)
107Eve Diamond Mystery (1)
108Eve Duncan (2)
109Event Group Thriller (1)
110Fatal Insomnia Medical Thrillers (6)
111FBI Profiler (1)
112Final Theory (1)
113Fiona Griffiths Crime Thriller Series (1)
114Forensic Instincts (1)
115Fort Aldamo (57)
116Frank Wallerts Fälle (7)
117Frankenstein (1)
118Franz Eberhofer (3)
119G. F. Unger Sonder-Edition (102)
120G.F. Unger Classic-Edition (11)
121Gabriel Allon (1)
122Geisterjäger John Sinclair (6)
123Gideon Crew (2)
124Giordano Bruno (1)
125Go-get-’em Women (1)
126Good Lawyer (3)
127Grant County (3)
128Graveyard Falls (1)
129Griffin Powell (1)
130Guardian (1)
131Hackberry Holland (3)
132Harrison Investigation (2)
133Harry Bosch (4)
134Harry Palmer (1)
135Hart and Drake (8)
136Hector Cross Series (1)
137Hercule Poirot (20)
138High Country Heroes (2)
139Hold On! (1)
140Holly Barker (1)
141Honeymoon Series James Patterson (1)
142I Heart (1)
143If I Run (4)
144In Death (2)
145Inspector Barbarotti (2)
146Inspector Lynley (3)
147Inspector Montalbano (2)
148Inspector Montalbano Mysteries (1)
149IQ (1)
150Iron Lace (1)
151Isas Requiem (1)
152Jack Noble (1)
153Jack Paris (1)
154Jack Reacher (2)
155Jack Sigler Thrillers (Chess Team) (1)
156Jack Stapleton & Laurie Montgomery series (1)
157Jacqueline Kirby (1)
158Jake Brigance (7)
159Jake Ransom (1)
160James Blake (2)
161Jane Harper Horror Novels (2)
162Jane Hawk (2)
163Jericho Quinn Thriller (8)
164Jerry Cotton Sammelband (5)
165Jerry Cotton Sammelbände (14)
166Jerry Cotton Sonder-Edition (84)
167Jerry Cotton Sonder-Edition Sammelbände (3)
168Jet (4)
169Joanna Stafford (1)
170Joe Dillard Series (1)
171Joe Pickett Series (2)
172Joe Pike series (1)
173Joe Sixsmith (3)
174Johannes-Hornoff-T… (1)
175John Reeves (2)
176John Sinclair Collection (18)
177John Sinclair Gespensterkrimi (1)
178John Sinclair Gespensterkrimi Collection (9)
179John Sinclair Großband (13)
180John Sinclair Sammelband (8)
181John Sinclair Sonder-Edition (67)
182John Sinclair Sonder-Edition Sammelband (7)
183Joona Linna (2)
184Judith Kepler (1)
185Jungle Beat (7)
186Karin Slaughter Thriller-Bundle (2)
187Kate Brannigan (4)
188Kate Ivory (14)
189Kate Maddox (2)
190Kathryn-Dance-Thri… (1)
191Kay Scarpetta (11)
192Kick Lannigan (2)
193Kimmo-Joentaa-Reihe (1)
194King and Maxwell (9)
195Kirstmann und Freytag (1)
196Kitt Lundgren (1)
197Kolt Raynor (1)
198Lassiter 2101-2200 (3)
199Lassiter 2201-2300 (10)
200Last Option Search Team (3)
201Last Stand (1)
202Leo Demidow (1)
203Leverage (2)
204Liam Devlin series (1)
205Lizzie Martin (2)
206Logan McRae (5)
207Logan McRae Collection (2)
208Louis Kincaid (1)
209Louise Rick series (2)
210Lucy Clayburn (3)
211Lucy Guardino FBI Thrillers (3)
212luebbe digital ebook (5)
213Luke Carlton (1)
214Luna Maiwald Rügenkrimi (1)
215Maddrax (4)
216Marc Dane (1)
217Marcus (1)
218Maura Ryan (2)
219Maximum Ride: The Manga (2)
220Maximum Security (1)
221Medical Thrillers (Gerritsen) (1)
222Mercy Kilpatrick (1)
223Mia Quinn (1)
224Michael Bennett (3)
225Michael Herne (1)
226Midwife (2)
227Miss Marple Mysteries (1)
228Mississippi (2)
229Mitchell & Associates (4)
230Monster Hunter International (1)
231Nameless Detective (3)
232Natalie King, Forensic Psychiatrist (1)
233Nick Hall (2)
234Night Soldiers (1)
235Nils Trojan (1)
236Nomad (1)
237NYPD Red (2)
238Odd Thomas (2)
239Operation: Midnight (1)
240OPSIG Team Black Series (1)
241P.I.D. (2)
242Penn Cage Novels (2)
243Peter Decker & Rina Lazarus (4)
244Peter Decker/Rina Lazarus (4)
245Petra Connor (1)
246Pilgrim (3)
247Predator & Prey (1)
248Prey (5)
249Privatdetektiv Marten Hendriksen (1)
250Private (2)
251Promise Falls Trilogy (1)
252Raines of Wind Canyon (2)
253Random House Large Print (3)
254Relatively Dead Mysteries (1)
255Richard “Dick” Moonlight (1)
256Rizzoli-&-Isles-Serie (2)
257Robert Langdon (1)
258Robicheaux (7)
259Rocky Mountain Bounty Hunters (1)
260Rocky Mountain K9 Unit (4)
261Ryan Archer (1)
262Sakura Warrior – Reihe (1)
263Sally Harrington (1)
264Sam Berger Series (1)
265Sam Capra Mysteries (2)
266Samson (1)
267San Francisco (1)
268Sandhamn Murders (2)
269Sanela Beara (1)
270Sarah Pauli (2)
271Scarlet Falls (1)
272Scope (2)
273Sean Dillon (5)
274Search and Rescue (4)
275Second Opportunities (1)
276Selena Alvarez/Regan Pescoli (1)
277Shane Schofield (1)
278Sharon McCone (3)
279Sharpe & Donovan (2)
280Shaw and Katie James (7)
281Sigma Force (7)
282Simon Vaughn (2)
283Sisterhood (3)
284Six Stories (2)
285Skink (1)
286Smoky Barrett (3)
287Smoky Barrett Sammelband (1)
288Soko Hamburg – Ein Fall für Heike Stein (18)
289Sonderermittler der Krone (5)
290Spilling CID (1)
291Split Second (1)
292Stalking Jack the Ripper (1)
293Stephanie Plum (4)
294Stephanie Plum Between the Numbers/Holiday Novels (1)
295Stillhouse Lake (6)
296Stone Barrington (7)
297Stranger Things Novels (2)
298Superintendent Battle (4)
299Svartåsen (1)
300Talisman (5)
301Tall, Dark & Dangerous (1)
302Temperance Brennan (6)
303Teodor Szacki (2)
304Texas Rangers (2)
305Texas Trilogy (2)
306The Annie Graham series (1)
307The Avalon Chronicles (3)
308The Awakening Series (1)
309The Bening Files (2)
310The Bill Hodges Trilogy (3)
311The Blaine Trilogy (1)
312The Butlers (6)
313The Cal O’Connor Series (1)
314The Cards in the Deck (2)
315The Cat Who… (23)
316The Cemetery of Forgotten Series (1)
317The China Thrillers (3)
318The Clifton Chronicles (10)
319The Color of Distance (1)
320The Commandant Camille Verhoeven Trilogy (2)
321The Cooper & Fry Series (1)
322The Cousins War (4)
323The Dark Iceland Series (1)
324The Dark Tower (6)
325The Death Trilogy (3)
326The End Series (1)
327The Flovent and Thorson Thrillers (1)
328The Immune (4)
329The Kate Lange Thriller Series (2)
330The Keepers (3)
331The Men Of The Sisterhood (1)
332The Mitch Rapp Prequel Series (8)
333The Mitch Rapp Series (31)
334The Oxygen Thief Diaries (2)
335The Paul Chavasse Novels (2)
336The Pieter Van In Mysteries (1)
337The Psychic Detectives Series (1)
338The Restoration Series (5)
339The Retreat (2)
340The Roth Trilogy (3)
341The Sara Winthrop Thriller Series (1)
342The Scot Harvath Series (45)
343The Sean Coleman Thriller series (1)
344The Talisman (5)
345The Tallow Series (1)
346The Warm Bodies Series (1)
347Thomas Eickhoff ermittelt (1)
348Thomas Kell (3)
349Thomas Knight (1)
350Tina Boyd (5)
351Todeslächeln (2)
352Tom Thorne (2)
353Tom Thorne series (1)
354Tommy and Tuppence (6)
355Tracers Series (1)
356Troubleshooters (1)
357Turbulent Desire Series (2)
358Twin Ports (1)
359Ty Hauck (3)
360Under Suspicion (1)
361Undercover Cops (1)
362Unit 51 (1)
363V.I. Warshawski Novels (2)
364Vampire Chronicles (1)
365Vampire Federation (1)
366Vintage Contemporaries (1)
367Virgil Flowers (1)
368Wayward Pines (6)
369Wegner & Hauser – Hamburg: Mord (2)
370Wegners erste Fälle (8)
371Wegners schwerste Fälle (9)
372Will Lee Novels (1)
373Will Robie (2)
374Will Trent/Atlanta Series (1)
375Will-Trent-Serie (1)
376William Sandberg (1)
377Wired (2)
378Wired & Dangerous (1)
379Wishbone (1)
380Women’s Murder Club (9)
381World War I (1)
382World’s Scariest Places Occult & Supernatural Crime Series (7)
383Wyman Ford (7)

16 thoughts on “Algorithms Could Save Book Publishing—But Ruin Novels”

  1. If they used a training set of 5,000 books, how did the algorithm perform when faced with a test of another 5,000 books it had never seen?

    I’d also ask why they chose to publish rather than shop their system to publishers.

  2. It’s a lot worse than PG says. Publishing executives aren’t precisely innumerate — they can plug numbers into spreadsheets to make decisions for them all day long. The problem is that the entertainment industry is anti-science, perhaps the epitome of C.P. Snow’s “Two Cultures” problem: Because the subject of the entertainment industry is “art,” gathering and evaluating actual data gets no attention whatsoever.

    Consider the most-obvious flaw in the Archer and Jockers project (other, that is, than that it is extolled as cutting-edge in Wired, which is almost always a tinfoil-hat warning): It was conducted on each work as if each work stood alone. Not just “part of a series,” but “in time” and “in immediate social context.” Consider the rise in badly-limned followers-of-Mohammed “bad guys” starting in the late 1990s and accelerating rapidly afterward, compared to followers-of-Stalin “bad guys” in the same period. And so on.

    And as flawed as that project is, it’s vastly more searching than anything done in the publishing industry. For example, there’s a meme that trade books with predominantly green covers don’t sell. It was out of date when it was thrown at me in the 1990s: It was based on a combination of lighting characteristics and ink chemistries from the early 1960s that existed practically nowhere by 1990. Similarly for “embossed lettering sells books” (how does that work on Amazon, BTW — let alone for e-books, when the metallic shades most common in embossed lettering get distorted by 56 different types of displays?). But I’ve seen both of these memes presented as absolute, irrefutable fact in the last two years.

    The real problem is that management doesn’t want to know anything that might require it to make expensive changes to its existing system.

    • Because the subject of the entertainment industry is “art,” gathering and evaluating actual data gets no attention whatsoever.

      TV ratings do seem to get a passing glance.

      • … which assumes:

        That TV ratings are meaningful as to actual perception of the programming (example: how much of Super Bowl ratings are for the commercials?)

        That the means of gathering TV ratings bear some relationship to reality

        That the time over which TV ratings are gathered for different programs allows comparison (is “same day plus three” comparable for football games, weekday soap operas, Judge Judy, Survivor, and the final episode of M*A*S*H?)

        That “ratings” directly correlate to “enthusiasm”

        Remember, the ratings are not used to establish anything except “rate chargeable to advertisers”; they don’t actual reflect revenue (there is more than one program currently airing with higher ratings that can’t ordinarily sell all of its ad time at the full rate, for example)

        In short, “tracking stuff to which a number can be attached” is not necessarily gathering valid data.

        • … which assumes:

          It assumes nothing.

          It’s an observation that ratings are actual data that gets significant attention from the entertainment industry. And, as you say, that data reflects what advertisers will pay. Price is one of the important decision variables in a for-profit operation.

          And is entertainment art? I don’t know. Who cares? Not consumers, not advertisers, and not the entertainment industry. Maybe artists care.

          • The people who interpret the data do, in fact, assume every one of the things that CE mentioned and you so airily dismissed. Raw data is of very little use unless you understand what you are measuring, why it is (or isn’t) important, and what its limitations are.

          • Uh, ratings *aren’t* actual data.
            They are algorithmically processed incomplete data.
            Instead of actual viewers, they are a weighted aggregate of live viewers, dvr viewers, and first-week streaming viewers. With each category getting a different multiplier.
            That is why the highest ratings go to sports and why advertisers pay more for them: they aren’t dvr-able and viewers can’t pause and return a day later.
            Nielsen doesn’t even collect full data. They estimate viewership by varying means, some of which are selfreported.
            It is simply a standardized metric that ranks programs’ estimated mass appeal. It doesn’t even come close to measuring profitabilty since many low-rated shows are gold mines for the producers while higher rated productions are money losers which means the metric isn’t even good at what it is really supposed to measure. Think of saddle shows: half hour shows slotted between two popular shows. They have good ratings whike carried by the saddle but tank when set as the lead show in a block because the show lacks intrinsic appeal. More often than not it is simply prefered to watching the second half hour of a one hour show.
            So no, ratings aren’t real data and the TV world knows it.
            They just have nothing better.

            • I agree ratings data don’t show profitability. Who but the conjunctively disjunctive think they come close? They don’t consider costs.

Comments are closed.