The real voyage of [data] discovery consists not in seeking new landscapes, but in having new eyes--Marcel Proust
The ability to grab nonproprietary data for question generation and exploration might be the most neglected resource. For example, continuing medical education companies create programs at an immense profit often based on hunches, heuristics, or whatever the funder or sponsor deems worthy. What might the landscape or outcomes consider if there were better survey analytics or relevant metrics analyzed?
I have tried. Most often I am told that the preferred mechanism or platform is to keep the status quo. Useless pre- and post-evaluation measuring nothing of value or multiple choice formats that do not mirror how physicians take care of patients. The ever present Likert style questions confusing unimodal and multimodal scales. I will say two things here--again. Values on a Likert scale are not comparable.
A score of 5 on one question is just that--a five on one question. Second, if you start using ranking scales you will improve your data quality and analyses exponentially. Why?
How are we communicating about point of care interventions in the absence of decision choice experiments, determinations of value for each stakeholder, and collaborations with patients?
Think about probabilities. In medicine we measure probabilities. What is your risk for a disease, what are the chances you will respond to treatment--think about large scale risk/benefit questions. Today we aren't tackling surveys though--you can read Write Better Surveys. Period. for the price of a cup of fancy coffee or stay tuned to future posts.
I wrote a quick post over on LinkedIn that generated a lot of interest and not a small amount of private messages and queries. As much as I would prefer to engage colleagues publicly to foster broad discussion I get it--you are a shy bunch. I shared my method for jumping into a large dataset and looking around. Obviously, there is more work to be done but once you can see what questions might be able to be answered with data--you can ask better questions and begin questioning data.
A big dataset with a lot of options for data curiosity is the ClinicalTrials.gov database of the US National Library of Medicine. A friend was curious about the reason clinical trials are terminated and he had some static spreadsheets with aggregated answers. I thought it might be interesting to dig a little deeper so gave him the interactive visualization above, studies reporting "low accrual" or studies reporting reasons other than "low accrual" available at Tableau Public and this one as well, actively recruiting clinical trials for colorectal cancer.
I typically start over on the website for downloading content for analysis--Clinical Trials Transformative Initiative (CTTI)'s Database for Aggregate Analysis of ClinicalTrials.gov (AACT). I suggest reading the page to better understand your options for data but if you are a beginner, scroll down to the link for the Clinical Trials project. This simplifies the download of aggregated data from the publicly available relational database.
Once you click on download, it will take you to the following options. If you want to run a local copy of the AACT database (static) you can install PostgreSQL server and viola, there you go. I no longer use this option as from time to time the code will break and not accept my password. It creates this loop of resetting password and waiting and waiting.
I started using pipe-delimited files and working in R. I will show you a brief line of code or two so you can see how easily this option might work for you. The advantage of pipe-delimited files? They look like this column 1|column 2|column 3|column 4 so instead of being separated by a delimiter that might also be a value, such as a comma (in csv for example)--you have the pipe character.
Due to the size of a large database, it isn't recommended that you open it up in Excel for example. I like to work in RStudio. Because I teach a lot of courses in R, I prefer R Studio Cloud (the data and environment stay with each instance of a project) but for now let's just look at RStudio.
If you want to learn more we can set up a remote session where you can log into my cloud environment and interact live while we explore a database of interest (see below)...
I pulled a single table from the downloaded file and here is an example of how you can quickly examine data with a few lines of code. I quickly sampled 200,000 samples for quick information on column headings and can make a quick estimate regarding what I might find. For example, look at the column "reason".
We have over 50 other tables to explore for granularity by condition, clinical trial phase, year of start or completion, results availability, or clinical trial design to name a few.
Here is the list of other tables in the dataset. Make sure you review the data dictionary to help you select tables of interest.
Closer examination of the tables in the database are informative and I am happy to explore the data with you to help guide you through how you might use either R, Python, or Tableau Prep to clean up data for visualization. Stay tuned...