Enough Data... But Don't Forget Your Context
Just about every newspaper and online news source in the developed world has reported on the outing of J.K. Rowling of Harry Potter fame as the author of the new book "The Cuckoo's Calling." Apparently, quite a bit of technology went into the analysis of the book. The WSJ has a pretty good piece on the technology behind it. As a nice benefit, some of the general usage of technology for literary analysis and forensics has come to light. The NYTimes has an interesting piece called "Crunching Literary Numbers" that describes how similar analysis gives insight into that which we think were common themes in literature and society may not have actually been accurate.
Here are some important, very relevant to business, insights from the two parts of the story:
- Ask the right question: The right question is the most important part. True, the Galbraith/Rowling analysis allowed the investigators to show that it was highly likely that Rowling authored the book. But they first had to have an inkling that it was Rowling. They got that from good old-fashioned human sleuthing. Quiet simply, their suspicions were aroused because the representative at the publisher who handled "Cuckoo" was the same person who handled Harry Potter. The chances of a super-successful representative also handling a complete unknown with middling critical approval are quite low. No computer figured this out; it was human intuition. Thus, the possibility that Galbraith was a pseudonym had to be raised, and then the questions, "who is Galbraith really", and "who are the most likely candidates," had to be answered before feeding the data into the analytics.
- Get enough data: As Egnal showed in the NYTimes piece, literature reviewers, historians and critics, even the greatest (I went to Columbia, but Lionel Trilling died over 15 years before I got there), are human. They are a mix of limited time and built-in bias, even unintentional. By definition, they can only sample so many works before drawing their conclusions. If you sample 50 books, and all show a tendency towards pessimism in 1937, you might well conclude that the Great Depression was a period of bleak pessimism. But if 10,000 books were written, and you only (intentionally or not) sampled those 50, which were half of all of the pessimistic ones published, leaving 9,900 optimistic or at least neutral, you haven't got enough data. In math, we call this a "statistically significant data set"; in the real world, we say, "make sure you've checked with enough people."
- Understand your context: You need to understand the context and validity of the data. To take the previous example, if 10,000 books were published, and 100 were pessimistic, at first blush, you would conclude that the times were optimistic (or at least neutral). But if of those 10,000, 9,875 sold no more than 100 copies, while the remaining 125 - including 100 bleak and pessimistic and 25 neutral - sold 100,000 or more copies, then the 125 count far more. That in and of itself may be an indicator that some people tried to change the mood, but the general population was bleak, pessimistic, and uninterested in an optimistic message.
As an aside, those statistics about book sales and mood in 1937 are a purely fictional example.
If you really want to understand the real world, especially if you are going to make crucial business decisions: ask the right questions; get enough data; and understand the context and value of the data you get.