A recent article, STATISTICAL PARADISES AND PARADOXES IN BIG DATA (I): LAW OF LARGE POPULATIONS, BIG DATA PARADOX, AND THE 2016 US PRESIDENTIAL ELECTION reminds me of a great definition of statistics, "Principled thinking and methodology development for dealing with uncertainty." My intent is certainly not to be partisan but to highlight data that could just as easily been rendered in a behemoth healthcare data set. IPUMS describes the process well, "Think of census data needing to be cleaned, merged with new data, editing routines developed, and millions of strings coded into useful classifications. The data are far too large for manual inspection, requiring efficient data analysis and scalable approaches including machine learning."

The article is technical but can be parsed into tangible little morsels. Here are a few that interest me. Large-sample asymptotic is basically the approximation or limit of a graph--you have probably heard of the Central Limit Theorem and it has a role on the limits of Big Data. So I don't go down the rabbit hole of technical jargon this is a great analogy from the article (you should read it). Think of statistics and dealing with uncertainty--the author uses food metaphors--they help.

The article is technical but can be parsed into tangible little morsels. Here are a few that interest me. Large-sample asymptotic is basically the approximation or limit of a graph--you have probably heard of the Central Limit Theorem and it has a role on the limits of Big Data. So I don't go down the rabbit hole of technical jargon this is a great analogy from the article (you should read it). Think of statistics and dealing with uncertainty--the author uses food metaphors--they help.

Fast food will always exist because of the demand—how many of us have repeatedly had those quick bites that our doctors have repeatedly told us stay away from?

But this is the very reason that we need more people to work on understanding and warning about the ingredients that make fast food (methods) harmful; to study how to reduce the harm without unduly affecting their appeal; and to supply healthier and tastier meals (more principled and efficient methods) that are affordable (applicable) by the general public (users).

I will bullet the most important findings in the article taken mostly verbatim:

- difference between the sample average and the population average is the product of three terms: (1) a data quality measure, (2) a data quantity measure, and (3) a problem difficulty measure
- the most critical—yet most challenging to assess—among the three is data quality.
- When combining data sources for population inferences, those relatively tiny but higher quality ones should be given far more weights than suggested by their sizes.
- data quality must be a relative one, and more precisely it should be termed as data quality for a particular study. This is because any meaningful quantification of data quality must depend on (1) the purposes of the analysis—a dataset can be of very high quality for one purpose but useless for another; (2) the method of analysis (e.g., the choice of sample average instead of sample median); and (3) the actual data the analyst includes in the analysis
- Big Data Paradox-- The bigger the data, the surer we fool ourselves.

Fig 4. (please go to article for details) comparison of actual vote shares with 2016 Cooperative Congressional Election Study (CCES) estimates across 50 states and DC. "Color indicates a state’s partisan leanings in 2016 election: solidly Democratic (blue), solidly Republican (red), or swing state(green). The left plot uses sample averages of the raw data (n = 64,600) as estimates; the middle plot uses estimates weighted to likely voters according to turnout intent (estimated turnout nˆ = 48,106);and the right plot uses sample averages among the subsample of validated voters (subsample size, n = 35,829)."

Observe the only voting regions where the confidence intervals (barely) cover the actual results (Trump Data). And it should provide a clear warning of the Big Data Paradox: it is the larger turnouts that lead to more estimation errors because of systemic (hidden) bias, contrary to our common wisdom of worrying about increased random variations by smaller turnouts.

This article is one you may need to read and re-read several times. I find it important as we enter the age of personalized medicine. You can see here--even if the mathematics underlying the problem are complex--individualized predictions are approximations.

Because each of us is unique, any attempt to “personalize” must be approx- imative in nature. But this is a very different kind of notion of approximation, compared to the traditional large-sample asymptotics, where the setup is to use a sample of individuals to learn about the population they belong to.-Xiao-Li Meng, Whipple V.N. Jones Professor of Statistics, Harvard Faculty of Arts & Sciences Editor in Chief of the Harvard Data Science Review

In contrast, individualized prediction is about finding a sequence of proxy populations with in- creased resolutions to learn about an individual. This leads to an ultimate challenge for Statistics (and statisticians): how to build a meaningful theoretical foundation for inference and prediction without any direct data?-

Why do so many of us view statistics and data literacy as something we just aren't built to comprehend?Sal Khan from Khan academy hits the nail on the head. If you are like me and annoyed by everything of value being tucked behind a paywall--his motto is, "You can learn anything. For Free. For Everyone. Forever."

Finally I now know why I am able to grasp advanced statistical principles (at least the big ideas) readily in my modern day work flow. He summarizes below in his brief and worthwhile talk below--only about 10 minutes but BOOM it is worth it. In a nutshell, when we are self-directed and learning or re-learning a skill or subject we tend to focus on what we don't know before moving on.

Finally I now know why I am able to grasp advanced statistical principles (at least the big ideas) readily in my modern day work flow. He summarizes below in his brief and worthwhile talk below--only about 10 minutes but BOOM it is worth it. In a nutshell, when we are self-directed and learning or re-learning a skill or subject we tend to focus on what we don't know before moving on.

In a traditional academic model, we group students together, usually by age, and around middle school, by age and perceived ability, and we shepherd them all together at the same pace. And what typically happens, let's say we're in a middle school pre-algebra class, and the current unit is on exponents, the teacher will give a lecture on exponents, then we'll go home, do some homework.

The next morning, we'll review the homework, then another lecture, homework, lecture, homework. That will continue for about two or three weeks, and then we get a test. On that test, maybe I get a 75 percent, maybe you get a 90 percent, maybe you get a 95 percent. And even though the test identified gaps in our knowledge, I didn't know 25 percent of the material.Even the A student, what was the five percent they didn't know?

I never thought about the challenge of recalling academic courses and specializations in advanced Calculus, statistics, or programming languages. It was always about the test and let's face it, if you got an 85% on a test you were chuffed. Or at least I was. But think about the 15% of the material you didn't know? What if your car brakes were repaired only 85% of the way--would you consider it safe?

I thought I would share a few simple graphics I created to demonstrate clarity and what granularity can yield if we seek to understand what the small differences in a bar chart might mean if we are looking at the country as a whole.

The last graphic is of the electoral college votes--the only vote that matters in the US but you can see how additional clarity can once again be more insightful than reliance on a simple statistic about winning an election...thanks to Alberto Cairo for the image below.

*Reach out for any help with data literacy or creating your own data stories--either in your professional or personal life--brainstorming is always free! twitter.com/datamongerbonny*