A recent article, STATISTICAL PARADISES AND PARADOXES IN BIG DATA (I): LAW OF LARGE POPULATIONS, BIG DATA PARADOX, AND THE 2016 US PRESIDENTIAL ELECTION reminds me of a great definition of statistics, "Principled thinking and methodology development for dealing with uncertainty." My intent is certainly not to be partisan but to highlight data that could just as easily been rendered in a behemoth healthcare data set. IPUMS describes the process well, "Think of census data needing to be cleaned, merged with new data, editing routines developed, and millions of strings coded into useful classifications. The data are far too large for manual inspection, requiring efficient data analysis and scalable approaches including machine learning."
The article is technical but can be parsed into tangible little morsels. Here are a few that interest me. Large-sample asymptotic is basically the approximation or limit of a graph--you have probably heard of the Central Limit Theorem and it has a role on the limits of Big Data. So I don't go down the rabbit hole of technical jargon this is a great analogy from the article (you should read it). Think of statistics and dealing with uncertainty--the author uses food metaphors--they help.
Fast food will always exist because of the demand—how many of us have repeatedly had those quick bites that our doctors have repeatedly told us stay away from?
I will bullet the most important findings in the article taken mostly verbatim:
Fig 4. (please go to article for details) comparison of actual vote shares with 2016 Cooperative Congressional Election Study (CCES) estimates across 50 states and DC. "Color indicates a state’s partisan leanings in 2016 election: solidly Democratic (blue), solidly Republican (red), or swing state(green). The left plot uses sample averages of the raw data (n = 64,600) as estimates; the middle plot uses estimates weighted to likely voters according to turnout intent (estimated turnout nˆ = 48,106);and the right plot uses sample averages among the subsample of validated voters (subsample size, n = 35,829)."
Observe the only voting regions where the confidence intervals (barely) cover the actual results (Trump Data). And it should provide a clear warning of the Big Data Paradox: it is the larger turnouts that lead to more estimation errors because of systemic (hidden) bias, contrary to our common wisdom of worrying about increased random variations by smaller turnouts.
This article is one you may need to read and re-read several times. I find it important as we enter the age of personalized medicine. You can see here--even if the mathematics underlying the problem are complex--individualized predictions are approximations.
Because each of us is unique, any attempt to “personalize” must be approx- imative in nature. But this is a very different kind of notion of approximation, compared to the traditional large-sample asymptotics, where the setup is to use a sample of individuals to learn about the population they belong to.
Why do so many of us view statistics and data literacy as something we just aren't built to comprehend?Sal Khan from Khan academy hits the nail on the head. If you are like me and annoyed by everything of value being tucked behind a paywall--his motto is, "You can learn anything. For Free. For Everyone. Forever."
Finally I now know why I am able to grasp advanced statistical principles (at least the big ideas) readily in my modern day work flow. He summarizes below in his brief and worthwhile talk below--only about 10 minutes but BOOM it is worth it. In a nutshell, when we are self-directed and learning or re-learning a skill or subject we tend to focus on what we don't know before moving on.
In a traditional academic model, we group students together, usually by age, and around middle school, by age and perceived ability, and we shepherd them all together at the same pace. And what typically happens, let's say we're in a middle school pre-algebra class, and the current unit is on exponents, the teacher will give a lecture on exponents, then we'll go home, do some homework.
I never thought about the challenge of recalling academic courses and specializations in advanced Calculus, statistics, or programming languages. It was always about the test and let's face it, if you got an 85% on a test you were chuffed. Or at least I was. But think about the 15% of the material you didn't know? What if your car brakes were repaired only 85% of the way--would you consider it safe?
I thought I would share a few simple graphics I created to demonstrate clarity and what granularity can yield if we seek to understand what the small differences in a bar chart might mean if we are looking at the country as a whole.
The last graphic is of the electoral college votes--the only vote that matters in the US but you can see how additional clarity can once again be more insightful than reliance on a simple statistic about winning an election...thanks to Alberto Cairo for the image below.
Reach out for any help with data literacy or creating your own data stories--either in your professional or personal life--brainstorming is always free! twitter.com/datamongerbonny
Sign up for our newsletter!
Browse the archive...
Thank you for making a donution!
In a world of "evidence-based" medicine I am a bigger fan of practice-based evidence.
Remember the quote by Upton Sinclair...
“It is difficult to get a man to understand something, when his salary depends upon his not understanding it!”