I don’t. Next question? I am only partially joking. The most common format for the output of most non-proprietary large datasets (at least in healthcare) seems to be CSV. Occasionally I can grab a SAS file but I think spreadsheets are here to stay. A CSV file has all of the formatting and formulas stripped out of the file so although they are still cumbersome--they work.
This data is from the Household Pulse 2020 COVID household survey from the Census. You can readily see that the ability to gather any information about the shape of this data is limited.
Writing a few lines of Python code can provide information about the shape of data and the variables included although unless you are familiar with the data, you will also need to download the data dictionary. This particular survey contains 82 columns and 132,961 entries or rows.
You can also explore data on CENSUS website and use their interactive tool. I usually start here and formulate data questions as I go. Reach out with any questions. The newly launched newsletter will be designed to included links for deeper dive tutorials or a focused narrative for less tech orientated subscribers. You can subscribe here. Because I am switching my existing list of subscribers to the old format over to the new format--anyone subscribing to the new format before the end of September will continue to have access for free.
One thing many of us working in statistics and data literacy can agree on is the broken pedagogy and misalignment between maths and the existing teaching curriculum.
Now, because of COVID-19 we are taking that broken foundational model and moving it to remote learning--what could go wrong?
When I teach underlying mathematical principles in statistical or data science course I am leap-frogging over the memorization and boring bits and moving right to the application. Perhaps not ideal, but if the goal is to teach a team how to reach the part of the workflow where they can begin to curate insights from their data--a few corners are going to need to be cut.
Here is the rub though. They often learn more in the over-simplification because they never knew what they were doing down in the weeds anyway. For example, when you are data modeling--what is the shape of your data? We talk about linear, sinusoidal, or quadratic relationships. I write about it briefly in this blog--Maths in the real world.
We all have heard the lamenting about why take calculus. “When am I ever going to use it?” Did you know derivatives can tell you a lot of information in the real world? How about whenever you think about rate of change of a function? Most recently while calculating the COVID-19 rate of positive tests for example. Also when we think of population growth in biology or marginal functions in economics.
I like to introduce the brilliance of maths that we can stand back and marvel or appreciate. Recently, a post On apple trees and man described Benford’s Law. Discovering the not so random nature of big data provides a glimpse of the complexity but also mystery of math. A look beyond the rote memorization introduction that led many of us to avoid math simply out of principle.
The quote below is from an informative discussion about online-instruction and how we need to Teach Better.
Anyway. the key thing there is that the relevance has to be there for people to engage, and we also have to think about how do you kind of shape knowledge in the discipline? You know, how does a novice look at things? And chemistry is a great example because when you're a chemist, you get good at dealing symbols.
I think the problem with symbols and not knowing the storytelling of their shorthand stops so many of us in our tracks. If you are integrating classroom response systems or “clickers” where you can respond to student gaps and questions in real time you can avoid the tendency to gloss over esoteric terms and abbreviations and mistakingly assume that all students are joining you on the journey.
Online workshops and webinars have taught me that we can’t do any of it in a meaningful way without engagement. Here is an article, The Classroom Observation Protocol for Undergraduate STEM (COPUS): A New Instrument to Characterize University STEM Classroom Practices. I use it as a model for teaching technical topics remotely. I hope you will steal these ideas to make your work more engaging.
Here is the podcast episode where they provide a bit more context to the work being done in STEM specifically in Chemistry but you can easily connect the ideas to how we our teaching statistics for example.
It is remarkable how closely the history of the apple tree is connected with that of man.-- Henry David Thoreau
I’m not judging but I am not typically a binge-watcher of TV. A few notable exceptions would be Better Things (I watch it on a loop) and a new Netflix series, Connected. Latif Nasser is a science journalist with a likable foppish personality that intentionally or unintentionally hides a complex and thinking human.
Okay maybe “hide” is the wrong word. He is definitely packaging knowledge by distracting us from the "veggies in the sauce". You aren’t aware of how important and technical these topics are because they are seasoned with a bit of graphic artistry and film noir. All of the episodes will draw you in. The 3rd episode about “Dust” explains how the archaeologic remains in a dried lake in the Sahara desert replenishes phosphorus washed away by the rains in the Amazon basin. And other fun facts I had no idea about. These dust storms are visible from space and influence weather systems as well as our health and wellness.
Connected: Digits (episode 4 in series)
The connection running throughout the series is attributed to the “Hidden Science of Everything”. If you work in science or with data you likely are familiar. We know that skills in data science or research findings for example are not homogenized and isolated bits of information. But too often we create silos of knowledge any way. Instead of thinking cinematic we think linear. Learn this skill. Now this one. Okay here is another. A piecemeal attempt to understand the chaos and intersectionality of everything. I am a big advocate of pushing around the edges of seemingly disparate ideas until we detect a slight alignment.
The episode about digits introduces us to Benford’s Law. Back in the day before calculators, books of logarithms were published. Observation of a wide variety of data sets yielded something interesting. The random numbers were not random after all. Their distribution was following an unknown pattern. Unknown--but quietly present in all of the data. Impossible to not see once you become aware of its presence.
You can read more about the history of Benford’s law over at The Conversation. Or explore by visiting the page below (simply click on the image). There is a wide variety of datasets available for you to apply the law and see what happens.
You can dig deeply over on Wikipedia as well. Benford's law, also called the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law.
Thinking outside of our specific box not only broadens our awareness but allows us to see the vast number of “boxes” on the horizon.
There are certain tasks that I have been doing a certain way forever and ever. I did not realize how complex I was making some of these work flows until I was asked to teach a class over on Teachable. Faced with creating a recipe I noticed that perhaps--in some cases--the juice was no longer worth the squeeze. Too many steps to explain to a heterogenous audience. It is one thing to slog through content if you know everyone has been baselined and we are starting from the same spot.
It is quite another thing to know that some will be yawning while others are likely to be gnashing their teeth in frustration. I won’t pretend to know the right balance but here is what I know.
The best approach is to have mini conversations. These are general to be sure but allow exposure to the vastness of a complicated subject. I realize that it is your filter for information that matters--not mine. I imagine you will learn the way I did. Collecting little bits of information here and there, retaining the glittery bits to “feather your nest” as it were.
I had this epiphany while working with Census Data. Like many of you, I have been working on a work flow ahead of the 2020 Census. Unless you are teaching, you tend to fly through certain steps and only realize your error once you are reviewing the webinar or recording. Over on Teachable the learning curve was steep but not unsurmountable.
One day I might wear makeup or add style but for now it is all about content.
This week, we will build a map. Stay tuned...
There is a commercial that says something like, “Savoring the moments that were always there.” It reads like a silver lining to self-isolation during a global pandemic. The trouble with many of us--we have been working remotely for a long time. Now the secret is out. Depending on your sensibilities this has been perhaps an “aha!” moment or a glimpse into a reality that isn’t for you.
I lean toward the savoring side. In fact, one of the reasons I consider myself “unemployable” is I would never consider traveling to an office. Unless it is down a set of stairs, through the foyer and dining room and into a quiet small office. Actual traditional employment was off the table before Covid-19. Now it is quite banned from the table and I would argue not even allowed in the house.
Technology often laments about slow adoption and implementation but carefully avoids the responsibility the people have shirked by still going about business like a pack of luddites. If you don’t believe me go apply for a job. Take a rich history of successful collaborations and outcomes and cram it into the equivalent of a chiseled stone plaque. The portals for submitting CVs are outdated, inefficient, and ask you to replicate information or experience already outlined elsewhere...a specially fun task if you work as a data scientist or analyst.
Job descriptions for technical professionals are often compared to finding a mystical unicorn. I am not sure who is responsible for writing the job description fodder but I want whatever they are imbibing that stimulates the delusion. Many requirements for certain expertise using a platform or software exceed the existence of the platform or software.
I receive dozens of messages from recruiters and HR “professionals” offering me wonderful opportunities specific to my skills. Except they have no idea what my skills might be. It would take them 5 seconds to find out that I run my own consultancy in data analytics--not likely I am going to chuck it all for a 9 to 5. But they persist.Think I’m Mad as Hell from Network.
On the other hand, if I was a recruiter or similar professional, why not use LinkedIn like the resource it could be? Read what folks are posting, look for the diamond in the rough and start identifying prospective employees with harpoons instead of wide nets.
The scene from Network reminded me of David Mamet because I confused it with Glengarry Glen Ross which he actually did write. Part of my savoring what has always been, was to actually watch MasterClass. I bought my husband a class a few years ago and managed to parlay that into a yearly subscription at a reduced rate.
It is, moreover, evident from what has been said, that it is not the function of the poet to relate what has happened, but what may happen- what is possible according to the law of probability or necessity. The poet and the historian differ not by writing in verse or in prose. The work of Herodotus might be put into verse, and it would still be a species of history, with meter no less than without it. The true difference is that one relates what has happened, the other what may happen. Poetry, therefore, is a more philosophical and a higher thing than history: for poetry tends to express the universal, history the particular.--Aristotle
My decision to no longer write manuscripts for publication weighed heavy on my mind for several months. This audio of David Mamet was like a nice tidy bow. I recently broke my vow to remove myself from the dubious role of medical writer. In the era of Covid-19 pandemic public speaking engagements dwindled or died on the vine. I said yes to work that should have been a hard no. I haven’t been a full-time medical writer in over a decade. Historically, a writer would pull together resources and summarize the existing data. Next, we helped develop the research question but not so specifically as to leave opposing data out of the conversation. We developed an annotated outline where the actual collaborations kicked off in high gear. Typically these conversations were so informed and nuanced that they were the meat on the bones of a strong outline. The authors' voices were the point of the manuscript--my role was simply to create a unified voice and narrative for submission.
Well, that was then, this is now. Now the ring leader is the client of the client. Companies spring up for hire to write whatever it is you envision--long before a patient has enrolled in a clinical trial--and often way upstream from FDA approval. The goal is for them to please their client, not inform at the point of care to improve cost, quality of life, or outcomes. Profit rules not people. I get it. I am and was well-paid to do this. Ridiculously compensated for an even more ridiculous task. And we all pretend we are doing good.
We are not.
Show up, shut up, do your job. That is all they want. Its secretarial. Its marketing.
My normal routine has been disrupted. I am betting you can relate. What remains though is vital and can illuminate the foundational elements that can moor us and keep us sane--or at least keep the crazy bits to a minimum.
Most days a group of National Press Club members gather for 30 minutes on Zoom. The Journalism Institute hosts our discussions and presents a quick writing prompt to focus our discussion. Often timely or pulled from the community at large we share thoughts, laughs, outside resources, and a nontrivial amount of camaraderie and support.
One of my contributions was a glimpse into how I am managing to conduct workshops, speak at virtual events, and keep a bold working life in the face of grounded flights, cancelled venues, and unmitigated chaos. I thought I would share some of the tools and resources that have been monumental in this shift. Most, if not all, internet resources have a free option for exploring before purchasing. And the majority of these suggestions are reasonably priced. I think about it like wearing Prada boots with a pair of Target jeans. Make the necessary investments when and if you are able.
HyperDrive hub for managing additional ports
I purchased portable USB LED Video lights on adjustable tripod stands. You don't need to get them all at once but a pack of two was less than $60. You might have seen the loop/circular variety but I think they would get crushed during travel unless you are hyper vigilant. You will be surprised about how much better you look with proper lighting.
You can see the set-up in the corner of my office. It has the yellow filter (comes with white also) and I have it adjusted so it is right behind the webcam tripod when I am standing at my desk. You can turn on your camera and the play with the lighting as it is fully adjustable and see where you look less like a zombie and more like your best self. I also have another one to my right that either sits on the desk or on the floor with the tripod fully extended.
On top of my laptop you can see the webcam. It quickly clips onto the laptop or slides into the top of the tripod immediately behind it int he photo.
The last few things are subscription based tools I discovered along the way. A few I am trying out to see if they are useful enough to warrant a yearly subscription. Simply agree to a monthly trial for now--your mileage may vary.
For example, Noun Project offers icons and other useful graphics for building data visualizations, online courses, reports, and any customizable deliverable where you want to add a little polish.
You are going to thank me for this one. I have hours and hours of conference sessions recorded--some where I am speaking but others on interesting topics in unique venues or with notable experts. I meant to transcribe them or repurpose those Zoom talks I have given but who has time for that. Although this would be the perfect task for an assistant, I don't have one at the moment.
Try Descript. I primarily use it for simultaneously editing video and audio but think about creating a podcast or editing interviews.
My experience with Descript introduced me to Loom. Loom is asynchronous tool that allows me to teach data analytics or survey design for example by screen casting. A simple workflow might be something like this. I record sessions on topics, and then edit them on Descript, and upload to my Teachable course. Boom.
In all honesty I have limited experience with Canva but I am also exploring perhaps integrating it into my workflow. I am starting to gravitate away from outside platforms and hoping to rely on my blog to share, message, and link to opportunities for engagement. I don't think I would need both Noun Project and Canva but I am exploring.
This has been a quick review of my foundation or roots in this time of upheaval and uncertainty. I hope buried here is an insight or suggestion to make your road a little smoother.
There are a lot of ways to support the blog if you found something monumental or time-saving. Share with a few friends, respond with a few of your own favorite tools, donate, or connect over on twitter.
I have been working on a course on Teachable that isn't ready for primetime just yet but newsletter subscribers and sustaining donors will get links to courses for free. I will keep you in the loop.
Not sure if this escaped your interest but we have had our own modern day water pump in theories about the widespread COVID-19 in New York. The part of the water pump is being played by subway turnstiles and is a fascinating read if nothing else. The following is a working paper that reads like a conversation--one of the reasons I enjoyed reading and interpreting the data. Full disclosure please be aware that this National Bureau of Economic Research paper is not peer-reviewed and is circulated for discussion and comments only.
THE SUBWAYS SEEDED THE MASSIVE CORONAVIRUS EPIDEMIC IN NEW YORK CITY
New York City’s multitentacled subway system was a major disseminator – if not the principal transmission vehicle – of coronavirus infection during the initial takeoff of the massive epidemic that became evident throughout the city during March 2020. The near shutoff of subway ridership in Manhattan – down by over 90 percent at the end of March – correlates strongly with the substantial increase in the doubling time of new cases in this borough. Maps of subway station turnstile entries, superimposed upon zip code-level maps of reported coronavirus incidence, are strongly consistent with subway-facilitated disease propagation. Local train lines appear to have a higher propensity to transmit infection than express lines. Reciprocal seeding of infection appears to be the best explanation for the emergence of a single hotspot in Midtown West in Manhattan. Bus hubs may have served as secondary transmission routes out to the periphery of the city.
For each station, the idea is first to compute the time trends in turnstile entries and coronavirus incidence, and then assesses whether there is a relation between the two trends across different subway stations (Fredriksson and Oliviera 2019). Unfortunately, there is a serious problem with this extraordinarily popular method of doing policy analysis (Bertrand, Duflo, and Mullainathan 2004). In particular, there is likely to be significant serial correlation in the outcomes among adjacent subway stations situated along the same line.
Following the realization that looking at the individual subway stations may not be the appropriate unit of analysis, the discussion reveals the utility of considering subway lines. I will summarize the static model of epidemic propagation discussed in more detail in the paper but basically susceptible individuals are classified as S and their contact with infectious individuals is classified as I.
Incidence of new infection depends on the frequency of contact between S and I and the probability that there is transmission of infection.
The Goscé model offers a number of insights that are immediately applicable to the data from the New York City Flushing subway line. The first is that the rate of disease transmission is related to the number of trips and average number of stations per trip along the entire subway line, and not just to the number of entries at any one subway station. Second, passengers entering the subway line even at a remote, less populous station are slowing down the system, thus increasing the transit time that the S’s stay in contact with the I’s. Third, those uninfected S- passengers who cram shoulder-to-shoulder into a particular subway are increasing train-car density and thus raising the average number of other S-passengers infected by an I-passenger who happens to be standing in the middle of the train. Fourth, local trains – like the Flushing local – are more likely to seed epidemic infections than express lines. Finally, an entire subway line, rather than the individual stations or subway cars, is the appropriate unit of analysis.
An important consideration is the impact of reducing train service likely accelerated the spread of virus as commuters found themselves crammed into fewer cars for longer periods of time.
One distinguishing factor between the present study and prior work is that seasonal influenza has generally had a reproductive number R in the range of 1.2–1.4, while pandemic influenza has had an R in the range of 1.4–1.8, with the high end representing the 1918 pandemic (Biggerstaff et al. 2014). By contrast, we have estimated the R in New York City during the initial surge of infections in early March to be on the order of 3.4 (Harris 2020). An overall assessment of these research efforts may lead some scientific reviewers to conclude that cause-and-effect remains difficult to prove. Still, we doubt whether any public health practitioner would be reluctant to take action on the basis of the facts we now know.
Harris, J. E. 2020. The Coronavirus Epidemic Curve Is Already Flattening in New York City.
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3563985: National Bureau of Economic Research Working Paper No. 26917, April 3, 2020
If you want data--we have data. The Johns Hopkins Interactive Resource provides the data that fuels many of the graphics exploding across the internet. I have shared thoughts in panel discussions and here, Data after coronavirus...what survives?
In a reality that may never return, I present topics in data literacy across a wide-variety of industries but mostly community or population level data. When you are live with full access to attendees it is synergistic to be able to offer clarifications, deep dives into topics that arise, or even question how data is sourced, prepared, analyzed, and communicated. More importantly--we can challenge how the data question was formulated. Often, this requires modifying questions to better serve the data available.
In publications with limited space for back-story and education regarding terms, models, or algorithms used we can unintentionally mislead. In fact, the format of my classes often begins with the viewing of art standing in for graphics, selected to reveal biases we may not even be aware of...
In the current environment where we are isolated, bombarded with information, and perhaps fearful this is evolving into a perfect storm for misinterpretation. I have noticed battles on twitter between statisticians, epidemiologists, data scientists, and even economists. Lots of grumbling about statistical models, weak assumptions, and who bears the right to pontificate or offer expertise.
Let me tell you my perspective for what it may or may not be worth. I believe that statistics, epidemiological principles, data science, and economics are all tools and information we need to understand as data professionals. Learning to read a visualization and to create them requires a deep understanding of a lot of edges from different industries.
But here is also the thing. I use many tools, like python for example, without the understanding of a developer. Perhaps this is easier than falling short in other skills because it is hard to reach a wrong conclusion if you can't even get the code to run! I don't feel inferior, and neither should you, when using statistical tools or complex analytic algorithms. If they are to be applied to complex problems we should be able to gain a workable amount of fluency to know what we know and hope for collaboration and conversation when we are wandering off in the wrong direction.
I would like to roll back this discussion to a few foundational elements of the maths and assumptions that underlie much of the confusion. And where I think we need the experts to clarify and engage in a narrative that elucidates not isolates.
The notion that we can manage without models and that sufficient quantities of data—big data—can take the place of models is a seductive one.--What is the Purpose of Statistical Modeling
We can gather data of the scale visualized in the COVID-19 Dashboard, but do the numbers indeed speak for themselves? It is vital to recognize that there are indeed different types of models. Only one of them, "data-driven, empirical, or interpolatory" can't be wrong. It is simply summarizing underlying data--empirical models can however; serve no purpose and have low value.
On the one hand we have theory-driven, theoretical, mechanistic, or iconic models, and on the other hand we have data-driven, empirical, or interpolatory models. Theory-driven models encapsulate some kind of understanding (theory, hypothesis, conjecture) about the mechanism underlying the data, such as Newton’s Laws of motion in mechanics, or prospect theory in psychology. In contrast, data-driven models merely seek to summarize or describe the data.--What is the purpose of statistical modeling?
David Hand, Professor of Mathematics and Senior Research Investigator at Imperial college in London and author of What is the Purpose of Statistical Modeling published in the Harvard Data Science Review, cautions that Theory-driven models can indeed be wrong or misleading. Think about the scope of COVID-19 visualized by confirmed cases, death tallies, and hospitalization rates perhaps not representing the actual reality they are intended to represent.
For example, if you are not familiar with data visualization or statistics and simply view the COVID-19 projections as published by the Institute for Health Metrics and Evaluation you may not realize that the light purple shaded graphic represents the uncertainty around the measures. I rely on these graphics to describe resource allocation projections but not knowing this vital piece of information if all you are doing is a quick glance can change the game.
The Financial Times has been my favorite resource. They are offering free access to COVID-19 stories (thankfully). I read my allotment of complimentary stories but the $60/month fee for full access is a little steep. I like the readability and well-annotated graphics.
The more you read and look at data visualizations the more there is to learn. Pre-attentive attributes guide our attention but aren't reliable for determining what information might indeed be missing.
We should consider the Breiman definition of information, "to extract information about how nature is associating the response variables to the input variables."
You might detect the illusion of prediction and information when we are missing many of the input variables to describe relative frequencies of disease (COVID-19 testing across the population regardless of symptoms), modes of transmission, estimates of actual number of cases. These limitations will indeed impact our ability to plan interventions and allocate health resources.
For prediction, data-driven models are ideal–indeed in some sense optimal. Given the model form (e.g. a linear relationship between variables) or a criterion to be optimized (e.g. a sum of squared errors), they can give the best fitting model of this form, and if the criterion is related to predictive accuracy, the result is necessarily good within the model family. In contrast, theory-driven models are required for understanding, although of course they can also be used for prediction.--David Hand, What is the purpose of statistical modeling
I was recently discussing limitations and potential harms of using readily available statistics reported with the rapidly accumulating data. The Tableau Dashboard although well-intentioned might tempt many to generate graphics of limited value. In times like these, I would definitely continue to explore and learn about the data through these free platforms but I would caution the data family to yield the outcomes and insights to professionals. Here is a summary of insights gleaned from a recent article in The Guardian, Coronavirus statistics: what can we trust and what should we ignore?
I would be cautious of data reporting a daily count of confirmed cases or new deaths.
We are not testing the entire population. Determinations of eligibility for testing is widely heterogenous. Consider counties where you have to be admitted to a hospital or exhibiting profound symptoms vs. exposure to a positive case to be confirmed vs. testing the whole population.
If the sickest of all of us are being tested--would you be surprised to see increasing death rates? What about deaths not attributed to confirmed cases but likely due to COVID-19? Are we testing the dead? What if the death occurs before the test results have been returned? How are co-morbidities being attributed on death certificates?
What about false negatives? We have an expanded pool of professionals applying swabs to nasal passages and throats--how are individuals previously tested as negative but now positive counted? How will home-testing impact the sensitivity and specificity of testing?
What methods are used to smooth the data so we can capture trends?
Logarithmic scales allow comparisons between populations--many have opinions on this but I think it reflects the exponential viral growth. Yes, you might miss the overall magnitude of the problem without the s-curve but when we have R-naught driving the spread of disease--I think log scale--as long as it is clearly defined, is helpful and relevant.
Data models can be useful but the media rarely provides the limitations of the chosen models or highlights the uncertainty.
The science behind antibody testing is beyond this discussion but I suggest you listen to this quick tutorial by Peter Attia MD. His podcast is one of only a few resources I read or listen to regularly about COVID-19.
Most of the big, attention-grabbing illustrations of data science in action are data-driven. But if theory-driven models can be wrong, data-driven models can be fragile. By definition they are based on relationships observed within the data which are currently available, and if those data have been chosen by some unrepresentative process, or if they were collected from a non-stationary world, then their predictions or actions based on the models may go awry.--David Hand, What is the purpose of statistical modeling?