Many clients and collaborative partners are excited to visualize their data. The previous post, The voyage of data discovery, and this one are attempts to create awareness around not only the tools but the complexities of formulating a question. This is all happening upstream from our final visualizations.
I have done a lot of work data sourcing. Finding open source data containing databases for query and exploration. The challenge is the scale of big data. Navigating to clinicaltrials.gov is a great example. Here is how I tackle looking around the tables in Python.
I gave a teaser about simple code I use for R last time. Direct any questions to the comment box or reach out on twitter @datamongerbonny.
Again, I am using jupyter notebook as I do work in data literacy and teach workshops--all requiring detailed descriptions. And the biggest reason? This is how I was taught Python in an executive education program at Columbia School of Engineering.
Alternatively you can run code right into your text editor. I use this if for my eyeballs only but it's a little messy if you are trying to create a narrative. This image demonstrates a snapshot only of how many individual files are in the database. I like to explore them individually or combine them in meaningful ways depending on the question to be formulated or answered.
Explaining the details of accessing your files or creating them is a bit beyond this post but we do have workshops forming all the time. I aim to keep it general for those that might just need a nudge or can reach out with a question. If you have downloaded a csv file you can import into Anaconda where you have your Python saved. This way, when you call the file, it is pulled into the console.
For example I imported the pandas module (or library) and assigned the common alias pd. The reason for the alias is to simplify the writing of code. The basic format is alias (pd), and function (read_csv) with the parentheses containing the arguments.
These tables are quite large so often during exploration I will call head( ) for the top 5 rows in each column. You can get an idea of what data type is in each column. On tables unfamiliar to me I will often use the info( ) function. It's a more granular overview--how many rows, columns, datatypes, and issues like memory usage. This is helpful if you are limited in memory and will need to edit your data request for specificity.
Having a quick list like the one above is handy when you are writing code and need to recall the column names and tables where they are located. I will continue to work through data sourcing and tidying before we begin steps to visualization. Remember it all begins with a well articulated question before we can identify and curate the best available data for generating insights and answers.
The art and science of asking questions is the source of all knowledge--Thomas Berger, American novelist