A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.--Tech Target
A quick digression, but one of my favorite Seth Godin quotes goes something like this, "Now that you have your ducks in a row, what are you going to do with the duck."--Watcha Gonna Do with That Duck?
This is relevant as a few recent projects had a duck to imagine. You might say the duck is at the heart of data governance. You now have a data lake--now what? The surprise for me is the sheer volume of analyses where once the insight is gleaned or report written the duck literally is nothing more than a spreadsheet on someone's desk top.
Although we tend to credit or blame things on a single major cause, in nature and in science there are almost always multiple factors that have to be exactly right for an event to take place. For example, we might attribute a forest fire to the carelessly thrown cigarette butt, but what about the grassy tract leading to the forest, the dryness of the vegetation, the direction of the wind and so on? All of these factors had to be exactly right for the fire to start. Even though many tossed cigarette butts don’t start fires, we zero in on human actions as causes, ignoring other possibilities, such as sparks from branches rubbing together or lightning strikes, or acts of omission, such as failing to trim the grassy path short of the forest. And we tend to focus on things that can be manipulated: We overlook the direction of the wind because it is not something we can control. Our scientifically incomplete intuitive model of causality is nevertheless very useful in practice, and helps us execute remedial actions when causes are clearly defined.
...it need not be the case that AA causes BB if AA remains correlated with BB when AA is produced by an act that is free in this sense, since it still remains possible that the free act that produces AA also causes BB via a route that does not go through AA. As an illustration, consider a case in which an experimenter’s administration of a drug to a treatment group (by inducing patients to ingest it) has a placebo effect that enhances recovery, even though the drug itself has no effect on recovery. There is a correlation between ingestion of the drug and recovery that persists under the experimenter’s free act of administering the drug even though ingestion of the drug does not cause recovery.
There are many of us elbow deep in data but lacking the powerful platform of Vinay to renounce the naked emperor in a way that creates attention and often tension. Daily, I review pre-clinical or phase II trial data with minimal if any efficacy that happened to squeak past the statistical test with no clear clinical benefit or marginal at best--all based on a weak surrogate outcome.
A recent article, Low-value approvals and high prices might incentivize ineffective drug development
in Nature Reviews Clinical Oncology (2018) raises important questions for consideration--especially if you are in the business of establishing efficacy and safety. Or writing about research findings typically spun with the most bombastic of claims certain to impact humanity and save us all.
Consider the example of the >1,000 clinical trials involving immune-checkpoint
inhibitors. In many cases, the biological rationale for these studies is either limited or absent.
In some scenarios, immune-checkpoint inhibitors are being tested in combination with other agents in the absence of single-agent activity, even though such activity is generally considered a promising prerequisite for the inclusion of anticancer drugs in combination therapies.
To make this calculation, first we note that accepting a single trial with a P- value < 0.05 as the threshold of significance means that, if one ran 100 trials for which the null hypothesis were true (that the drug is ineffective), on average, 5 trials would produce false-positive ‘statistically significant’ results.
Ignorant as science may still be about certain happenings in yeast, it’s dwarfed by our ignorance of what is going on in our own cells. Part of what makes a project like this one at the University of Toronto possible is that yeast has been heavily studied and its genes intricately annotated by several generations of biologists, to a degree not yet reached with the human genome, which is comparatively enormous, rambling and full of mysteries. Still, the researchers say that they hope that as gene-editing technology for human cells advances, these kinds of experiments can help reveal more about the workings of cells and how the genes within a genome relate to one another. “I think there are many basic rules of genome biology we have not discovered,” --How Many Genes Do Cells Need? Maybe Almost All of Them
The theme of this post highlights many things but the take home message for me is to be cautious of heuristics and simplified algorithms. Although potentially useful they run the risk of oversimplifying complex interactions--and yes, obfuscating a few forests and trees along the way...