The article is technical but can be parsed into tangible little morsels. Here are a few that interest me. Large-sample asymptotic is basically the approximation or limit of a graph--you have probably heard of the Central Limit Theorem and it has a role on the limits of Big Data. So I don't go down the rabbit hole of technical jargon this is a great analogy from the article (you should read it). Think of statistics and dealing with uncertainty--the author uses food metaphors--they help.
Fast food will always exist because of the demand—how many of us have repeatedly had those quick bites that our doctors have repeatedly told us stay away from?
But this is the very reason that we need more people to work on understanding and warning about the ingredients that make fast food (methods) harmful; to study how to reduce the harm without unduly affecting their appeal; and to supply healthier and tastier meals (more principled and efficient methods) that are affordable (applicable) by the general public (users).
- difference between the sample average and the population average is the product of three terms: (1) a data quality measure, (2) a data quantity measure, and (3) a problem difficulty measure
- the most critical—yet most challenging to assess—among the three is data quality.
- When combining data sources for population inferences, those relatively tiny but higher quality ones should be given far more weights than suggested by their sizes.
- data quality must be a relative one, and more precisely it should be termed as data quality for a particular study. This is because any meaningful quantification of data quality must depend on (1) the purposes of the analysis—a dataset can be of very high quality for one purpose but useless for another; (2) the method of analysis (e.g., the choice of sample average instead of sample median); and (3) the actual data the analyst includes in the analysis
- Big Data Paradox-- The bigger the data, the surer we fool ourselves.
Because each of us is unique, any attempt to “personalize” must be approx- imative in nature. But this is a very different kind of notion of approximation, compared to the traditional large-sample asymptotics, where the setup is to use a sample of individuals to learn about the population they belong to.
In contrast, individualized prediction is about finding a sequence of proxy populations with in- creased resolutions to learn about an individual. This leads to an ultimate challenge for Statistics (and statisticians): how to build a meaningful theoretical foundation for inference and prediction without any direct data?--Xiao-Li Meng, Whipple V.N. Jones Professor of Statistics, Harvard Faculty of Arts & Sciences Editor in Chief of the Harvard Data Science Review
Finally I now know why I am able to grasp advanced statistical principles (at least the big ideas) readily in my modern day work flow. He summarizes below in his brief and worthwhile talk below--only about 10 minutes but BOOM it is worth it. In a nutshell, when we are self-directed and learning or re-learning a skill or subject we tend to focus on what we don't know before moving on.
In a traditional academic model, we group students together, usually by age, and around middle school, by age and perceived ability, and we shepherd them all together at the same pace. And what typically happens, let's say we're in a middle school pre-algebra class, and the current unit is on exponents, the teacher will give a lecture on exponents, then we'll go home, do some homework.
The next morning, we'll review the homework, then another lecture, homework, lecture, homework. That will continue for about two or three weeks, and then we get a test. On that test, maybe I get a 75 percent, maybe you get a 90 percent, maybe you get a 95 percent. And even though the test identified gaps in our knowledge, I didn't know 25 percent of the material.Even the A student, what was the five percent they didn't know?