There is a commercial that says something like, “Savoring the moments that were always there.” It reads like a silver lining to self-isolation during a global pandemic. The trouble with many of us--we have been working remotely for a long time. Now the secret is out. Depending on your sensibilities this has been perhaps an “aha!” moment or a glimpse into a reality that isn’t for you.
I lean toward the savoring side. In fact, one of the reasons I consider myself “unemployable” is I would never consider traveling to an office. Unless it is down a set of stairs, through the foyer and dining room and into a quiet small office. Actual traditional employment was off the table before Covid-19. Now it is quite banned from the table and I would argue not even allowed in the house.
Technology often laments about slow adoption and implementation but carefully avoids the responsibility the people have shirked by still going about business like a pack of luddites. If you don’t believe me go apply for a job. Take a rich history of successful collaborations and outcomes and cram it into the equivalent of a chiseled stone plaque. The portals for submitting CVs are outdated, inefficient, and ask you to replicate information or experience already outlined elsewhere...a specially fun task if you work as a data scientist or analyst.
Job descriptions for technical professionals are often compared to finding a mystical unicorn. I am not sure who is responsible for writing the job description fodder but I want whatever they are imbibing that stimulates the delusion. Many requirements for certain expertise using a platform or software exceed the existence of the platform or software.
I receive dozens of messages from recruiters and HR “professionals” offering me wonderful opportunities specific to my skills. Except they have no idea what my skills might be. It would take them 5 seconds to find out that I run my own consultancy in data analytics--not likely I am going to chuck it all for a 9 to 5. But they persist.Think I’m Mad as Hell from Network.
On the other hand, if I was a recruiter or similar professional, why not use LinkedIn like the resource it could be? Read what folks are posting, look for the diamond in the rough and start identifying prospective employees with harpoons instead of wide nets.
The scene from Network reminded me of David Mamet because I confused it with Glengarry Glen Ross which he actually did write. Part of my savoring what has always been, was to actually watch MasterClass. I bought my husband a class a few years ago and managed to parlay that into a yearly subscription at a reduced rate.
It is, moreover, evident from what has been said, that it is not the function of the poet to relate what has happened, but what may happen- what is possible according to the law of probability or necessity. The poet and the historian differ not by writing in verse or in prose. The work of Herodotus might be put into verse, and it would still be a species of history, with meter no less than without it. The true difference is that one relates what has happened, the other what may happen. Poetry, therefore, is a more philosophical and a higher thing than history: for poetry tends to express the universal, history the particular.--Aristotle
My decision to no longer write manuscripts for publication weighed heavy on my mind for several months. This audio of David Mamet was like a nice tidy bow. I recently broke my vow to remove myself from the dubious role of medical writer. In the era of Covid-19 pandemic public speaking engagements dwindled or died on the vine. I said yes to work that should have been a hard no. I haven’t been a full-time medical writer in over a decade. Historically, a writer would pull together resources and summarize the existing data. Next, we helped develop the research question but not so specifically as to leave opposing data out of the conversation. We developed an annotated outline where the actual collaborations kicked off in high gear. Typically these conversations were so informed and nuanced that they were the meat on the bones of a strong outline. The authors' voices were the point of the manuscript--my role was simply to create a unified voice and narrative for submission.
Well, that was then, this is now. Now the ring leader is the client of the client. Companies spring up for hire to write whatever it is you envision--long before a patient has enrolled in a clinical trial--and often way upstream from FDA approval. The goal is for them to please their client, not inform at the point of care to improve cost, quality of life, or outcomes. Profit rules not people. I get it. I am and was well-paid to do this. Ridiculously compensated for an even more ridiculous task. And we all pretend we are doing good.
We are not.
Show up, shut up, do your job. That is all they want. Its secretarial. Its marketing.
My normal routine has been disrupted. I am betting you can relate. What remains though is vital and can illuminate the foundational elements that can moor us and keep us sane--or at least keep the crazy bits to a minimum.
Most days a group of National Press Club members gather for 30 minutes on Zoom. The Journalism Institute hosts our discussions and presents a quick writing prompt to focus our discussion. Often timely or pulled from the community at large we share thoughts, laughs, outside resources, and a nontrivial amount of camaraderie and support.
One of my contributions was a glimpse into how I am managing to conduct workshops, speak at virtual events, and keep a bold working life in the face of grounded flights, cancelled venues, and unmitigated chaos. I thought I would share some of the tools and resources that have been monumental in this shift. Most, if not all, internet resources have a free option for exploring before purchasing. And the majority of these suggestions are reasonably priced. I think about it like wearing Prada boots with a pair of Target jeans. Make the necessary investments when and if you are able.
HyperDrive hub for managing additional ports
I purchased portable USB LED Video lights on adjustable tripod stands. You don't need to get them all at once but a pack of two was less than $60. You might have seen the loop/circular variety but I think they would get crushed during travel unless you are hyper vigilant. You will be surprised about how much better you look with proper lighting.
You can see the set-up in the corner of my office. It has the yellow filter (comes with white also) and I have it adjusted so it is right behind the webcam tripod when I am standing at my desk. You can turn on your camera and the play with the lighting as it is fully adjustable and see where you look less like a zombie and more like your best self. I also have another one to my right that either sits on the desk or on the floor with the tripod fully extended.
On top of my laptop you can see the webcam. It quickly clips onto the laptop or slides into the top of the tripod immediately behind it int he photo.
The last few things are subscription based tools I discovered along the way. A few I am trying out to see if they are useful enough to warrant a yearly subscription. Simply agree to a monthly trial for now--your mileage may vary.
For example, Noun Project offers icons and other useful graphics for building data visualizations, online courses, reports, and any customizable deliverable where you want to add a little polish.
You are going to thank me for this one. I have hours and hours of conference sessions recorded--some where I am speaking but others on interesting topics in unique venues or with notable experts. I meant to transcribe them or repurpose those Zoom talks I have given but who has time for that. Although this would be the perfect task for an assistant, I don't have one at the moment.
Try Descript. I primarily use it for simultaneously editing video and audio but think about creating a podcast or editing interviews.
My experience with Descript introduced me to Loom. Loom is asynchronous tool that allows me to teach data analytics or survey design for example by screen casting. A simple workflow might be something like this. I record sessions on topics, and then edit them on Descript, and upload to my Teachable course. Boom.
In all honesty I have limited experience with Canva but I am also exploring perhaps integrating it into my workflow. I am starting to gravitate away from outside platforms and hoping to rely on my blog to share, message, and link to opportunities for engagement. I don't think I would need both Noun Project and Canva but I am exploring.
This has been a quick review of my foundation or roots in this time of upheaval and uncertainty. I hope buried here is an insight or suggestion to make your road a little smoother.
There are a lot of ways to support the blog if you found something monumental or time-saving. Share with a few friends, respond with a few of your own favorite tools, donate, or connect over on twitter.
I have been working on a course on Teachable that isn't ready for primetime just yet but newsletter subscribers and sustaining donors will get links to courses for free. I will keep you in the loop.
Not sure if this escaped your interest but we have had our own modern day water pump in theories about the widespread COVID-19 in New York. The part of the water pump is being played by subway turnstiles and is a fascinating read if nothing else. The following is a working paper that reads like a conversation--one of the reasons I enjoyed reading and interpreting the data. Full disclosure please be aware that this National Bureau of Economic Research paper is not peer-reviewed and is circulated for discussion and comments only.
THE SUBWAYS SEEDED THE MASSIVE CORONAVIRUS EPIDEMIC IN NEW YORK CITY
New York City’s multitentacled subway system was a major disseminator – if not the principal transmission vehicle – of coronavirus infection during the initial takeoff of the massive epidemic that became evident throughout the city during March 2020. The near shutoff of subway ridership in Manhattan – down by over 90 percent at the end of March – correlates strongly with the substantial increase in the doubling time of new cases in this borough. Maps of subway station turnstile entries, superimposed upon zip code-level maps of reported coronavirus incidence, are strongly consistent with subway-facilitated disease propagation. Local train lines appear to have a higher propensity to transmit infection than express lines. Reciprocal seeding of infection appears to be the best explanation for the emergence of a single hotspot in Midtown West in Manhattan. Bus hubs may have served as secondary transmission routes out to the periphery of the city.
For each station, the idea is first to compute the time trends in turnstile entries and coronavirus incidence, and then assesses whether there is a relation between the two trends across different subway stations (Fredriksson and Oliviera 2019). Unfortunately, there is a serious problem with this extraordinarily popular method of doing policy analysis (Bertrand, Duflo, and Mullainathan 2004). In particular, there is likely to be significant serial correlation in the outcomes among adjacent subway stations situated along the same line.
Following the realization that looking at the individual subway stations may not be the appropriate unit of analysis, the discussion reveals the utility of considering subway lines. I will summarize the static model of epidemic propagation discussed in more detail in the paper but basically susceptible individuals are classified as S and their contact with infectious individuals is classified as I.
Incidence of new infection depends on the frequency of contact between S and I and the probability that there is transmission of infection.
The Goscé model offers a number of insights that are immediately applicable to the data from the New York City Flushing subway line. The first is that the rate of disease transmission is related to the number of trips and average number of stations per trip along the entire subway line, and not just to the number of entries at any one subway station. Second, passengers entering the subway line even at a remote, less populous station are slowing down the system, thus increasing the transit time that the S’s stay in contact with the I’s. Third, those uninfected S- passengers who cram shoulder-to-shoulder into a particular subway are increasing train-car density and thus raising the average number of other S-passengers infected by an I-passenger who happens to be standing in the middle of the train. Fourth, local trains – like the Flushing local – are more likely to seed epidemic infections than express lines. Finally, an entire subway line, rather than the individual stations or subway cars, is the appropriate unit of analysis.
An important consideration is the impact of reducing train service likely accelerated the spread of virus as commuters found themselves crammed into fewer cars for longer periods of time.
One distinguishing factor between the present study and prior work is that seasonal influenza has generally had a reproductive number R in the range of 1.2–1.4, while pandemic influenza has had an R in the range of 1.4–1.8, with the high end representing the 1918 pandemic (Biggerstaff et al. 2014). By contrast, we have estimated the R in New York City during the initial surge of infections in early March to be on the order of 3.4 (Harris 2020). An overall assessment of these research efforts may lead some scientific reviewers to conclude that cause-and-effect remains difficult to prove. Still, we doubt whether any public health practitioner would be reluctant to take action on the basis of the facts we now know.
Harris, J. E. 2020. The Coronavirus Epidemic Curve Is Already Flattening in New York City.
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3563985: National Bureau of Economic Research Working Paper No. 26917, April 3, 2020
If you want data--we have data. The Johns Hopkins Interactive Resource provides the data that fuels many of the graphics exploding across the internet. I have shared thoughts in panel discussions and here, Data after coronavirus...what survives?
In a reality that may never return, I present topics in data literacy across a wide-variety of industries but mostly community or population level data. When you are live with full access to attendees it is synergistic to be able to offer clarifications, deep dives into topics that arise, or even question how data is sourced, prepared, analyzed, and communicated. More importantly--we can challenge how the data question was formulated. Often, this requires modifying questions to better serve the data available.
In publications with limited space for back-story and education regarding terms, models, or algorithms used we can unintentionally mislead. In fact, the format of my classes often begins with the viewing of art standing in for graphics, selected to reveal biases we may not even be aware of...
In the current environment where we are isolated, bombarded with information, and perhaps fearful this is evolving into a perfect storm for misinterpretation. I have noticed battles on twitter between statisticians, epidemiologists, data scientists, and even economists. Lots of grumbling about statistical models, weak assumptions, and who bears the right to pontificate or offer expertise.
Let me tell you my perspective for what it may or may not be worth. I believe that statistics, epidemiological principles, data science, and economics are all tools and information we need to understand as data professionals. Learning to read a visualization and to create them requires a deep understanding of a lot of edges from different industries.
But here is also the thing. I use many tools, like python for example, without the understanding of a developer. Perhaps this is easier than falling short in other skills because it is hard to reach a wrong conclusion if you can't even get the code to run! I don't feel inferior, and neither should you, when using statistical tools or complex analytic algorithms. If they are to be applied to complex problems we should be able to gain a workable amount of fluency to know what we know and hope for collaboration and conversation when we are wandering off in the wrong direction.
I would like to roll back this discussion to a few foundational elements of the maths and assumptions that underlie much of the confusion. And where I think we need the experts to clarify and engage in a narrative that elucidates not isolates.
The notion that we can manage without models and that sufficient quantities of data—big data—can take the place of models is a seductive one.--What is the Purpose of Statistical Modeling
We can gather data of the scale visualized in the COVID-19 Dashboard, but do the numbers indeed speak for themselves? It is vital to recognize that there are indeed different types of models. Only one of them, "data-driven, empirical, or interpolatory" can't be wrong. It is simply summarizing underlying data--empirical models can however; serve no purpose and have low value.
On the one hand we have theory-driven, theoretical, mechanistic, or iconic models, and on the other hand we have data-driven, empirical, or interpolatory models. Theory-driven models encapsulate some kind of understanding (theory, hypothesis, conjecture) about the mechanism underlying the data, such as Newton’s Laws of motion in mechanics, or prospect theory in psychology. In contrast, data-driven models merely seek to summarize or describe the data.--What is the purpose of statistical modeling?
David Hand, Professor of Mathematics and Senior Research Investigator at Imperial college in London and author of What is the Purpose of Statistical Modeling published in the Harvard Data Science Review, cautions that Theory-driven models can indeed be wrong or misleading. Think about the scope of COVID-19 visualized by confirmed cases, death tallies, and hospitalization rates perhaps not representing the actual reality they are intended to represent.
For example, if you are not familiar with data visualization or statistics and simply view the COVID-19 projections as published by the Institute for Health Metrics and Evaluation you may not realize that the light purple shaded graphic represents the uncertainty around the measures. I rely on these graphics to describe resource allocation projections but not knowing this vital piece of information if all you are doing is a quick glance can change the game.
The Financial Times has been my favorite resource. They are offering free access to COVID-19 stories (thankfully). I read my allotment of complimentary stories but the $60/month fee for full access is a little steep. I like the readability and well-annotated graphics.
The more you read and look at data visualizations the more there is to learn. Pre-attentive attributes guide our attention but aren't reliable for determining what information might indeed be missing.
We should consider the Breiman definition of information, "to extract information about how nature is associating the response variables to the input variables."
You might detect the illusion of prediction and information when we are missing many of the input variables to describe relative frequencies of disease (COVID-19 testing across the population regardless of symptoms), modes of transmission, estimates of actual number of cases. These limitations will indeed impact our ability to plan interventions and allocate health resources.
For prediction, data-driven models are ideal–indeed in some sense optimal. Given the model form (e.g. a linear relationship between variables) or a criterion to be optimized (e.g. a sum of squared errors), they can give the best fitting model of this form, and if the criterion is related to predictive accuracy, the result is necessarily good within the model family. In contrast, theory-driven models are required for understanding, although of course they can also be used for prediction.--David Hand, What is the purpose of statistical modeling
I was recently discussing limitations and potential harms of using readily available statistics reported with the rapidly accumulating data. The Tableau Dashboard although well-intentioned might tempt many to generate graphics of limited value. In times like these, I would definitely continue to explore and learn about the data through these free platforms but I would caution the data family to yield the outcomes and insights to professionals. Here is a summary of insights gleaned from a recent article in The Guardian, Coronavirus statistics: what can we trust and what should we ignore?
I would be cautious of data reporting a daily count of confirmed cases or new deaths.
We are not testing the entire population. Determinations of eligibility for testing is widely heterogenous. Consider counties where you have to be admitted to a hospital or exhibiting profound symptoms vs. exposure to a positive case to be confirmed vs. testing the whole population.
If the sickest of all of us are being tested--would you be surprised to see increasing death rates? What about deaths not attributed to confirmed cases but likely due to COVID-19? Are we testing the dead? What if the death occurs before the test results have been returned? How are co-morbidities being attributed on death certificates?
What about false negatives? We have an expanded pool of professionals applying swabs to nasal passages and throats--how are individuals previously tested as negative but now positive counted? How will home-testing impact the sensitivity and specificity of testing?
What methods are used to smooth the data so we can capture trends?
Logarithmic scales allow comparisons between populations--many have opinions on this but I think it reflects the exponential viral growth. Yes, you might miss the overall magnitude of the problem without the s-curve but when we have R-naught driving the spread of disease--I think log scale--as long as it is clearly defined, is helpful and relevant.
Data models can be useful but the media rarely provides the limitations of the chosen models or highlights the uncertainty.
The science behind antibody testing is beyond this discussion but I suggest you listen to this quick tutorial by Peter Attia MD. His podcast is one of only a few resources I read or listen to regularly about COVID-19.
Most of the big, attention-grabbing illustrations of data science in action are data-driven. But if theory-driven models can be wrong, data-driven models can be fragile. By definition they are based on relationships observed within the data which are currently available, and if those data have been chosen by some unrepresentative process, or if they were collected from a non-stationary world, then their predictions or actions based on the models may go awry.--David Hand, What is the purpose of statistical modeling?
Apparently I am an extroverted introvert. When in social situations I am calm, friendly, and have been accused of being mildly entertaining. I thrive both literally and figuratively on public speaking and being at the podium.The problem is--I prefer to stay home or with small groups of friends.
I tell you this to give you an idea of how I have been tempering the changes of the last month or so. Delays in projects, rescheduling of talks, and serious doubts about the status of conference appearances for the remainder of the year not withstanding--I remain pretty good. I have long been a creature of habit and more importantly, a remote worker. My last W-2 gigs (many years ago) were also jobs where I worked from a remote office and traveled to client locations or to the office on a quasi-quarterly basis.
Many of us have been watching the pandemic and either relying on data visualizations or recreating our own from raw data. The problem is--there are many missteps and fumbles around what the data is actually capable of contributing to the narrative.
If you are an epidemiologist or have studied epidemiology for public health many of the miscommunications are quite obvious. I think we could all use a better foundation in data literacy and fluency and what better place to start then with a map from Johns Hopkins Coronavirus Resource Center. The data is available for download in GitHub and you will find instructions and guidance. I recommend you read the resources providing information on the terms used to describe the pandemic and important guidelines regarding epidemiology.
Click to set custom HTML
When I view the map, the red sort of creates an ominous and deadly vibe. Yes, people are dying but perhaps we need to see context to understand--fear mongering will only get us so far. I barely noticed the green font depicting the number of people recovered. If the red dots are indicating confirmed cases it is much worse. Confirmed only means they were validated with a test--a test with its own biases and limitations. And we know in the US at least we are limited in testing or even providing the tests to populations of people in our communities.
Context is king when working with large complex datasets.
There are important considerations that need to accompany any visualization but COVID-19 data has a time horizon that is critical to clarify. For example, when were national measures enacted like shelter in place, or self-isolation (shown here by star symbols). What happens if we are only measuring confirmed cases in areas where tests are known to be largely unavailable or limited?
I personally prefer the selection of a logarithmic scale on the y-axis to better convey exponential viral growth. There is a lot to discuss in this graphic. Did you notice that the US does not have a marker indicating a national message to shelter in place? If we observe countries that have issued national orders--how long before the bend in the curve is evident?
Here are a few resources to help you make better visualizations...
There is a weird facet of my personality that applauds irony in all its iterations. I was asked to speak at a local community event for "innovators and entrepreneurs" highlighting the United Nations Sustainable Goals--specifically equality. My suggestion to introduce the utility of census data and how to access, clean, and analyze for free was welcomed.
Unfortunately, in the absence of effective marketing -- the draw of census data is not exactly standing room only. Because as it turned out--they must have been all standing somewhere else. The attendees mill about drinking free beer and nibbling on heavy hors d'oeuvres and once the second round of talks begin--they are typically engaged in other conversations. Not to worry, I persist.
Back to the census data. To explore broad questions beyond GDP -- the overworked metric of gross domestic product-- it is vital and it is important to dig deeper. And better yet the insights are free once you can tackle the steep learning curve. What an opportunity to meet your potential clients, patients, or customers in the communities where they live. Identify the barriers to improved outcomes by identifying structural determinants and working for policies to ameliorate wide disparities in not only income but opportunity.
If we believe that we, as Americans, are bound together by a common concern for each other, then an urgent national priority is upon us. We must begin to end the disgrace of this other America.
If you said I was stubborn you wouldn't be a liar. I refused to acknowledge the signs of an oncoming cold. Figuring I could run it out I did an easy 10 mile run hoping my oxygenated lungs would expunge the irritants and I would be back to being shiny and new. Let's just say all didn't go as planned and I spent the afternoon sipping tea spiked with a bit of whiskey and watching a few documentaries.
Signs of Humanity was a brilliant surprise. Willie Baronet is an artist and professor in Texas. Well, after watching his documentary I can say with confidence--he is also a filmmaker. His story illustrates the humanity and compassion evident in his interviews of over 100 homeless people. Offering to purchase their signs, he collects them and creates art installations to bring awareness and conversation to the front line of our debates on community and policy change.
I have always know that poverty isn't simply one thing. It is a cascade of small and large tragedies that can leave us hopeless, bereft, and completely alone.
Watch Signs of Humanity. If you have a Prime account it is free.
As an analyst, I can only measure what I bring to the discussion. If you write about poverty, social determinants of health, or other variables with an easy numeric tally or comparator you are leaving data on the table. The tensions we hold can help inform and elevate discussions.
There is no "other" in discussions of poverty. In an economy where we must keep our fingers crossed that we don't lose our jobs (and the benefits they provide), become ill, or need to reduce work load to care for ailing parents--there is no floor. You can drop right down to the bottom at the blink of an eye.
Over the years we have been lucky. My husband and I had our parents during the fragile years of building our own little family. There were so many random challenges that didn't seem to care if we were highly educated and well compensated. His boss shot himself in the heart and we were left without a steady income that had seemed teflon over the prior 17 year period. I once worked in Pharma and as companies were sold, merged, and scrapped--I began to appreciate the fragility of long term security.
I work in healthcare for the human side of medicine--not the profit motives winding through our fragile US health system. I want us to ask better questions and to do a better job at questioning answers. We need to pay attention, become data literate, and share our stories.
Follow along with me as we explore census data, government data, and other resources to help add a human dimension to a much needed narrative...
Think about the human genome, with its 3 billion base pairs: even in a very large randomised trial, say of 20 000 people, thousands of genetic differences are sure to arise between the groups. Some of these differences—which we do not yet know— might be important for prognosis. Randomisation guarantees that such differences are indeed due to chance. It means that statistical theory based on random sampling can be used to calculate confidence intervals that express the potential magnitude of such chance events...Vandenbroucke 2004 When are observational studies as credible as randomized trials?
I am not exactly sure when it is appropriate to affix the badge of data scientist or analyst to your lapel and boldly saunter out into the world. The battles are softly waged with words as statisticians claim certain inalienable rights as loose definitions of data literacy are assigned to the spoils. I am only responsible for my little corner of the world.
Through several post-graduate degrees I learned statistics and most recently completed an executive education program from the Fu School of Engineering at Columbia in Applied Analytics. The curriculum was quite a slog as mastery in Python came at the expense of real-time skills applicable to daily questions. I completed the course with a score of 97% but felt a little overwhelmed. I had questions. Questions that needed answers and answers that screamed for clarity. Slowly I began tackling Python library by library. I started with Pandas and began to gain proficiency outside of the need to pass an exam or submit a capstone.
So here we are. A data curious human navigates the world of health economics, health policy, and clinical medicine. Think about this. If we aren't intended to be able to clearly understand the research or media discussions in this space--who is the audience? My goal isn't to provide unyielding answers to these perturbations but to point and say, "What is that?" and "Why does this matter?"
When guiding or facilitating discussions around data literacy of fluency, I pull clinical research into the discussion and collaboratively step through the numbers to help inform colleagues about best practices or even self governance in relying on data to curate insights. For example, we may understand that randomized controlled trials are the "gold" standard for answering questions about tolerability, efficacy, and safety but is this always the case?
I am not debating the role of a well-designed study powered to answer a question but we need to understand the role of observational studies and the potential limits of RCTs. The majority of RCTs are limited by size and follow-up periods. If we are attempting to say anything meaningful about duration of response, safety or adverse events we need to look at case-control of large-scale observational studies.
To evaluate causal mechanisms of disease for example, we need to be able to navigate observational studies. How can we identify or adjust for confounding? Where do we begin while acknowledging that many confounders are insufficiently known or are unquantifiable?
What if we can create a visual representation of causal assumptions to identify potential pathways of confounding?
Directed acyclic graphs or DAGs have the variables connected by arrows that are all directed or pointing in the same direction. These represent direct causal effects. If we remove the arrow, the effect is no longer observed. And because causes (such as exposures) precede the effect (disease) we use a sort of chronology when creating a DAG.
Our a priori knowledge informs how we design the question.
directed acyclic graphs (Nephrology Dialysis Transplantation, Volume 30, Issue 9, September 2015, Pages 1418–1423)
A graphical presentation of confounding in DAGs. (a) The structure of confounding in DAGs. Since age is a common cause of CKD and mortality, confounding is present when we want to assess the causal relationship between the exposure CKD and the outcome mortality (b). The backdoor path from CKD via age to mortality can be blocked by conditioning on age, as depicted by a box around age in (c). Similarly, ethnicity is a common cause of obesity and decline in kidney function (d). The backdoor path from obesity via ethnicity to decline in kidney function can be blocked by conditioning on ethnicity. If ethnicity is not measured or not properly measured, residual confounding remains present.
As described by Suttorp and colleagues, the three criteria of confounding are:
1. Confounder must have an association with the outcome
2. Confounder must be associated with the exposure
3. Confounder must not be in the causal path from exposure to outcome (not a consequence of the exposure)
Confounding distorts the actual effects so now we need to remove as much of the impact as possible. The methods vary but it is critical to know how the confounding was identified (often if and methods used). To address confounding by age when evaluating the relationship between chronic kidney disease (CKD) and mortality, we might, for example, stratify by age.
Conditioning is the term used for adjusting for confounding and includes restriction, stratification, or multivariable analysis. The box around age in the figure above demonstrates that this confounder has been blocked.
What about the causal relationship between obesity and decline in kidney function as described in (d) above? If we make assumptions regarding race (research article conflates race and ethnicity--it is a social construct being misused as a biologic proxy but for purposes of the discussion I am not poking that tiger)-Differences in progression to ESRD between black and white patients receiving predialysis care in a universal health care system.-- and integrate prior research describing a faster decline in kidney function and progression to end-stage renal disease (ESRD) in blacks and higher obesity rates in African Americans (again, are we talking ethnicity or race?), we might define ethnicity as a confounder.
...It is, however, possible to identify confounding in a DAG that is impossible to adjust for. For instance, it could be that physicians did not record ethnicity, and ethnicity is thus unavailable in the data analyses. The investigator cannot adjust for a factor that is not measured. Similarly, it is possible that adjustments are only partly successful in controlling for confounding. For example, even if ethnicity was recorded and adjusted for in the analyses, some residual confounding can remain present.
More to come (stay tuned...)