Maybe I have completed one too many data project recently but a recent read of The Slippery Math of Causation made me chuckle. I considered if the data lake I was navigating might help douse the flames in this scenario artfully reminding us of the complexity of causation and correlation. And yes dear reader, data lakes are a thing.
A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.--Tech Target
A quick digression, but one of my favorite Seth Godin quotes goes something like this, "Now that you have your ducks in a row, what are you going to do with the duck."--Watcha Gonna Do with That Duck?
This is relevant as a few recent projects had a duck to imagine. You might say the duck is at the heart of data governance. You now have a data lake--now what? The surprise for me is the sheer volume of analyses where once the insight is gleaned or report written the duck literally is nothing more than a spreadsheet on someone's desk top.
Okay so you have glimpsed the series of synapses that led to the article above and subsequently this post. Listen to this great description of causality and correlation and why it is inherently problematic.
I read a ton of articles like the one cited in the quote above titled, Causation and Manipulability from the Stanford Encyclopedia of Philosophy. Primarily because without a glimpse of debates around statistical theory and practice you tend to lose sight of the forest through the trees. Ideology is strong when you are in an academic center but in my experience, things get a little squishy out in the real world.
...it need not be the case that AA causes BB if AA remains correlated with BB when AA is produced by an act that is free in this sense, since it still remains possible that the free act that produces AA also causes BB via a route that does not go through AA. As an illustration, consider a case in which an experimenter’s administration of a drug to a treatment group (by inducing patients to ingest it) has a placebo effect that enhances recovery, even though the drug itself has no effect on recovery. There is a correlation between ingestion of the drug and recovery that persists under the experimenter’s free act of administering the drug even though ingestion of the drug does not cause recovery.
Vinay Prasad, MD/MPH, Asst. Prof, Heme-Onc, EBM & Policy carefully evaluates and challenges the status quo in the wake of a tsunami of clinical trial data specifically in Oncology. I became a fan when he adopted a twitter thread discussion stream of thinking to present his points and to clarify his position among his large base of twitter followers.
There are many of us elbow deep in data but lacking the powerful platform of Vinay to renounce the naked emperor in a way that creates attention and often tension. Daily, I review pre-clinical or phase II trial data with minimal if any efficacy that happened to squeak past the statistical test with no clear clinical benefit or marginal at best--all based on a weak surrogate outcome.
A recent article, Low-value approvals and high prices might incentivize ineffective drug development
in Nature Reviews Clinical Oncology (2018) raises important questions for consideration--especially if you are in the business of establishing efficacy and safety. Or writing about research findings typically spun with the most bombastic of claims certain to impact humanity and save us all.
Consider the example of the >1,000 clinical trials involving immune-checkpoint
What if you run enough clinical trials and yield 5 false-positives out of 100? Can you make a hefty profit?
To make this calculation, first we note that accepting a single trial with a P- value < 0.05 as the threshold of significance means that, if one ran 100 trials for which the null hypothesis were true (that the drug is ineffective), on average, 5 trials would produce false-positive ‘statistically significant’ results.
To return to our probability scenario--In the ExteNET clinical trial, the 2-year invasive disease-free survival (surrogate end-point) rate was 93·9% (95% CI 92·4–95·2) in the neratinib group and 91·6% (90·0–93·0) in the placebo group. Perhaps instead of approving Neratinib we should find out what we are putting in that placebo...
Systematic analysis of complex genetic interactions in yeast demonstrate a dependency on extensive networks that potentially may play a key role in genotype-to-phenotype relationships, genome size, and speciation. An important reminder of how it is seldom
Ignorant as science may still be about certain happenings in yeast, it’s dwarfed by our ignorance of what is going on in our own cells. Part of what makes a project like this one at the University of Toronto possible is that yeast has been heavily studied and its genes intricately annotated by several generations of biologists, to a degree not yet reached with the human genome, which is comparatively enormous, rambling and full of mysteries. Still, the researchers say that they hope that as gene-editing technology for human cells advances, these kinds of experiments can help reveal more about the workings of cells and how the genes within a genome relate to one another. “I think there are many basic rules of genome biology we have not discovered,” --How Many Genes Do Cells Need? Maybe Almost All of Them
The theme of this post highlights many things but the take home message for me is to be cautious of heuristics and simplified algorithms. Although potentially useful they run the risk of oversimplifying complex interactions--and yes, obfuscating a few forests and trees along the way...
Browse the archive...
Thank you for making a donution!
In a world of "evidence-based" medicine I am a bigger fan of practice-based evidence.
Remember the quote by Upton Sinclair...
“It is difficult to get a man to understand something, when his salary depends upon his not understanding it!”
Sign up for our newsletter!