If you want data--we have data. The Johns Hopkins Interactive Resource provides the data that fuels many of the graphics exploding across the internet. I have shared thoughts in panel discussions and here, Data after coronavirus...what survives?
In a reality that may never return, I present topics in data literacy across a wide-variety of industries but mostly community or population level data. When you are live with full access to attendees it is synergistic to be able to offer clarifications, deep dives into topics that arise, or even question how data is sourced, prepared, analyzed, and communicated. More importantly--we can challenge how the data question was formulated. Often, this requires modifying questions to better serve the data available.
In publications with limited space for back-story and education regarding terms, models, or algorithms used we can unintentionally mislead. In fact, the format of my classes often begins with the viewing of art standing in for graphics, selected to reveal biases we may not even be aware of...
In the current environment where we are isolated, bombarded with information, and perhaps fearful this is evolving into a perfect storm for misinterpretation. I have noticed battles on twitter between statisticians, epidemiologists, data scientists, and even economists. Lots of grumbling about statistical models, weak assumptions, and who bears the right to pontificate or offer expertise.
Let me tell you my perspective for what it may or may not be worth. I believe that statistics, epidemiological principles, data science, and economics are all tools and information we need to understand as data professionals. Learning to read a visualization and to create them requires a deep understanding of a lot of edges from different industries.
But here is also the thing. I use many tools, like python for example, without the understanding of a developer. Perhaps this is easier than falling short in other skills because it is hard to reach a wrong conclusion if you can't even get the code to run! I don't feel inferior, and neither should you, when using statistical tools or complex analytic algorithms. If they are to be applied to complex problems we should be able to gain a workable amount of fluency to know what we know and hope for collaboration and conversation when we are wandering off in the wrong direction.
I would like to roll back this discussion to a few foundational elements of the maths and assumptions that underlie much of the confusion. And where I think we need the experts to clarify and engage in a narrative that elucidates not isolates.
The notion that we can manage without models and that sufficient quantities of data—big data—can take the place of models is a seductive one.--What is the Purpose of Statistical Modeling
We can gather data of the scale visualized in the COVID-19 Dashboard, but do the numbers indeed speak for themselves? It is vital to recognize that there are indeed different types of models. Only one of them, "data-driven, empirical, or interpolatory" can't be wrong. It is simply summarizing underlying data--empirical models can however; serve no purpose and have low value.
On the one hand we have theory-driven, theoretical, mechanistic, or iconic models, and on the other hand we have data-driven, empirical, or interpolatory models. Theory-driven models encapsulate some kind of understanding (theory, hypothesis, conjecture) about the mechanism underlying the data, such as Newton’s Laws of motion in mechanics, or prospect theory in psychology. In contrast, data-driven models merely seek to summarize or describe the data.--What is the purpose of statistical modeling?
David Hand, Professor of Mathematics and Senior Research Investigator at Imperial college in London and author of What is the Purpose of Statistical Modeling published in the Harvard Data Science Review, cautions that Theory-driven models can indeed be wrong or misleading. Think about the scope of COVID-19 visualized by confirmed cases, death tallies, and hospitalization rates perhaps not representing the actual reality they are intended to represent.
For example, if you are not familiar with data visualization or statistics and simply view the COVID-19 projections as published by the Institute for Health Metrics and Evaluation you may not realize that the light purple shaded graphic represents the uncertainty around the measures. I rely on these graphics to describe resource allocation projections but not knowing this vital piece of information if all you are doing is a quick glance can change the game.
The Financial Times has been my favorite resource. They are offering free access to COVID-19 stories (thankfully). I read my allotment of complimentary stories but the $60/month fee for full access is a little steep. I like the readability and well-annotated graphics.
The more you read and look at data visualizations the more there is to learn. Pre-attentive attributes guide our attention but aren't reliable for determining what information might indeed be missing.
We should consider the Breiman definition of information, "to extract information about how nature is associating the response variables to the input variables."
You might detect the illusion of prediction and information when we are missing many of the input variables to describe relative frequencies of disease (COVID-19 testing across the population regardless of symptoms), modes of transmission, estimates of actual number of cases. These limitations will indeed impact our ability to plan interventions and allocate health resources.
For prediction, data-driven models are ideal–indeed in some sense optimal. Given the model form (e.g. a linear relationship between variables) or a criterion to be optimized (e.g. a sum of squared errors), they can give the best fitting model of this form, and if the criterion is related to predictive accuracy, the result is necessarily good within the model family. In contrast, theory-driven models are required for understanding, although of course they can also be used for prediction.--David Hand, What is the purpose of statistical modeling
I was recently discussing limitations and potential harms of using readily available statistics reported with the rapidly accumulating data. The Tableau Dashboard although well-intentioned might tempt many to generate graphics of limited value. In times like these, I would definitely continue to explore and learn about the data through these free platforms but I would caution the data family to yield the outcomes and insights to professionals. Here is a summary of insights gleaned from a recent article in The Guardian, Coronavirus statistics: what can we trust and what should we ignore?
I would be cautious of data reporting a daily count of confirmed cases or new deaths.
We are not testing the entire population. Determinations of eligibility for testing is widely heterogenous. Consider counties where you have to be admitted to a hospital or exhibiting profound symptoms vs. exposure to a positive case to be confirmed vs. testing the whole population.
If the sickest of all of us are being tested--would you be surprised to see increasing death rates? What about deaths not attributed to confirmed cases but likely due to COVID-19? Are we testing the dead? What if the death occurs before the test results have been returned? How are co-morbidities being attributed on death certificates?
What about false negatives? We have an expanded pool of professionals applying swabs to nasal passages and throats--how are individuals previously tested as negative but now positive counted? How will home-testing impact the sensitivity and specificity of testing?
What methods are used to smooth the data so we can capture trends?
Logarithmic scales allow comparisons between populations--many have opinions on this but I think it reflects the exponential viral growth. Yes, you might miss the overall magnitude of the problem without the s-curve but when we have R-naught driving the spread of disease--I think log scale--as long as it is clearly defined, is helpful and relevant.
Data models can be useful but the media rarely provides the limitations of the chosen models or highlights the uncertainty.
The science behind antibody testing is beyond this discussion but I suggest you listen to this quick tutorial by Peter Attia MD. His podcast is one of only a few resources I read or listen to regularly about COVID-19.
Most of the big, attention-grabbing illustrations of data science in action are data-driven. But if theory-driven models can be wrong, data-driven models can be fragile. By definition they are based on relationships observed within the data which are currently available, and if those data have been chosen by some unrepresentative process, or if they were collected from a non-stationary world, then their predictions or actions based on the models may go awry.--David Hand, What is the purpose of statistical modeling?