A recent article, STATISTICAL PARADISES AND PARADOXES IN BIG DATA (I): LAW OF LARGE POPULATIONS, BIG DATA PARADOX, AND THE 2016 US PRESIDENTIAL ELECTION reminds me of a great definition of statistics, "Principled thinking and methodology development for dealing with uncertainty." My intent is certainly not to be partisan but to highlight data that could just as easily been rendered in a behemoth healthcare data set. IPUMS describes the process well, "Think of census data needing to be cleaned, merged with new data, editing routines developed, and millions of strings coded into useful classifications. The data are far too large for manual inspection, requiring efficient data analysis and scalable approaches including machine learning."
The article is technical but can be parsed into tangible little morsels. Here are a few that interest me. Largesample asymptotic is basically the approximation or limit of a graphyou have probably heard of the Central Limit Theorem and it has a role on the limits of Big Data. So I don't go down the rabbit hole of technical jargon this is a great analogy from the article (you should read it). Think of statistics and dealing with uncertaintythe author uses food metaphorsthey help. Fast food will always exist because of the demand—how many of us have repeatedly had those quick bites that our doctors have repeatedly told us stay away from?
I will bullet the most important findings in the article taken mostly verbatim:
Fig 4. (please go to article for details) comparison of actual vote shares with 2016 Cooperative Congressional Election Study (CCES) estimates across 50 states and DC. "Color indicates a state’s partisan leanings in 2016 election: solidly Democratic (blue), solidly Republican (red), or swing state(green). The left plot uses sample averages of the raw data (n = 64,600) as estimates; the middle plot uses estimates weighted to likely voters according to turnout intent (estimated turnout nˆ = 48,106);and the right plot uses sample averages among the subsample of validated voters (subsample size, n = 35,829)."
Observe the only voting regions where the confidence intervals (barely) cover the actual results (Trump Data). And it should provide a clear warning of the Big Data Paradox: it is the larger turnouts that lead to more estimation errors because of systemic (hidden) bias, contrary to our common wisdom of worrying about increased random variations by smaller turnouts.
This article is one you may need to read and reread several times. I find it important as we enter the age of personalized medicine. You can see hereeven if the mathematics underlying the problem are complexindividualized predictions are approximations.
Because each of us is unique, any attempt to “personalize” must be approx imative in nature. But this is a very different kind of notion of approximation, compared to the traditional largesample asymptotics, where the setup is to use a sample of individuals to learn about the population they belong to.
Why do so many of us view statistics and data literacy as something we just aren't built to comprehend?Sal Khan from Khan academy hits the nail on the head. If you are like me and annoyed by everything of value being tucked behind a paywallhis motto is, "You can learn anything. For Free. For Everyone. Forever."
Finally I now know why I am able to grasp advanced statistical principles (at least the big ideas) readily in my modern day work flow. He summarizes below in his brief and worthwhile talk belowonly about 10 minutes but BOOM it is worth it. In a nutshell, when we are selfdirected and learning or relearning a skill or subject we tend to focus on what we don't know before moving on. In a traditional academic model, we group students together, usually by age, and around middle school, by age and perceived ability, and we shepherd them all together at the same pace. And what typically happens, let's say we're in a middle school prealgebra class, and the current unit is on exponents, the teacher will give a lecture on exponents, then we'll go home, do some homework.
I never thought about the challenge of recalling academic courses and specializations in advanced Calculus, statistics, or programming languages. It was always about the test and let's face it, if you got an 85% on a test you were chuffed. Or at least I was. But think about the 15% of the material you didn't know? What if your car brakes were repaired only 85% of the waywould you consider it safe?
I thought I would share a few simple graphics I created to demonstrate clarity and what granularity can yield if we seek to understand what the small differences in a bar chart might mean if we are looking at the country as a whole.
The last graphic is of the electoral college votesthe only vote that matters in the US but you can see how additional clarity can once again be more insightful than reliance on a simple statistic about winning an election...thanks to Alberto Cairo for the image below.
Reach out for any help with data literacy or creating your own data storieseither in your professional or personal lifebrainstorming is always free! twitter.com/datamongerbonny
Comments are closed.

Sign up for our newsletter!
Browse the archive...
Thank you for making a donution!
In a world of "evidencebased" medicine I am a bigger fan of practicebased evidence.
Remember the quote by Upton Sinclair... “It is difficult to get a man to understand something, when his salary depends upon his not understanding it!” 