Not all data analysis tools are created equal.
Recently, I started looking into data sets to compete in Go Code Colorado (check it out if you live in CO). The problem with such diversity in data sets is finding a way to quickly visualize the data and do exploratory analysis. While tools like Tableau make data visualization extremely easy, the data isn’t always properly formatted to be easily consumed. Here’s are a few tips to help speed up your exploratory data analysis!
We’ll use data from two sources to aid with this example:
Picking the right tool
Always be able to answer the following before choosing a tool:
- How is your data stored? Is your data stored as JSON, XML, in a SQL/Relational Database, etc.?
- Are you allowed to use the data? Do you have credentials? Is it proprietary?
- Is it safe? Is it secure?
- Does it need to be updated? If so, how frequently? Is it streaming?
- What do you expect from exploratory analysis? Do you need a map, charts and/or tables?
When to use Tableau
Tableau has amazing out of the box connection capabilities. It can immediately consume multiple file types, connect to multiple types of databases and has pre-built connections to multiple services. Due to this flexibility, it’s almost always my first choice.
Use it if you have geospatial data types! It really saves you a lot of time because it handles longitude/latitude (or creates it based off of addresses) and maps everything incredibly easily.
It easily handles our ShapeFile (.shp) and makes mapping incredibly easy. All you have to do is drag and drop, the rest is taken care of for you!
When to avoid Tableau
I find trouble using Tableau with nested data sources, NoSQL databases, web scraping, and anything requested via https. It simply isn’t built for a lot of these things. The team has made a lot of upgrades in this department, but you will spend way too much time figuring out how to get your data into a usable format.
However, it isn’t free (and certainly isn’t cheap). Microsoft Power BI can be a great choice if you don’t have Tableau but doesn’t have quite as robust functionality in my opinion.
Our XML data is an example that simply cannot be read by Tableau (greyed out below):
Our JSON data is nested in a way that it becomes difficult to interpret and use easily in Tableau. If you dig deep enough, there may be a way to get the results you’re looking for but it’s not a trivial task.
When to use R
Typically, my next thought is to pull data into R. Due to the open source nature of R and its popularity in the data science world, you can usually find a way to connect with almost any data source! If you’re connecting to data online, you are able to use
RCurl in order to make requests. Almost all filetypes and databases are supported with a library (.xml =
library("XML") , .json =
library("jsonlite") …). You can also find a lot of libraries developed for popular API’s, for example, Mark Edmonson’s
R also has amazing data wrangling libraries such as
dplyr , and
tidyr (Hadley Wickham’s
tidyverse). These libraries have revolutionized how data manipulation and visualization is done. Code is easy to read, easy to write and usually works flawlessly.
Looking at our XML file, we can easily read it from the web with
RCurl and parse the data using the
XML library. From there, everything is easily accessible by using indices and the
You will also notice that I use RStudio. It has phenomenal functionality and is the best IDE for R (nothing even comes close). You can explore your data in a table view, and even nested objects can be drilled down into so you can see the data’s structure.
R also integrates with Tableau (and Power BI for that matter)! While it’s not the most intuitive or user-friendly, it consistently gets better and is becoming more powerful because you can integrate bleeding-edge machine learning technology or customize algorithms to build better models.
When to avoid R
This is a sensitive topic in the data science world and is very subjective. However, I recommend avoiding R when you have to clean the data before storing it. For example, scraping a website for multiple days in order to build a data set for analysis. While this can be done using
RVest (another phenomenal library), it tends to be more difficult and time consuming than using libraries in Python which tend to be built for automation (such as
scrapy). In reality, I don’t deal with data which cannot be consumed easily in R.
Another big concern people have is using “big data” with R because everything is stored in memory, however, in exploratory analysis I hope you aren’t pulling in that much data. If you are, try a subset in order to get an idea of what you’re dealing with.
When to use Python
This is typically my third option and I use it least in data exploration but the most in production environments. One of the biggest advantages with Python is when sharing exploration with a group of people. Using virtual environments is easy, fast and dependable. With a simple line of code:
pip install -r requirements.txt you will ensure that the packages you use are the same as the author’s (be careful, it doesn’t ensure that you’re using the same version of Python). If reproducibility is important to you, Python is where it’s at.
Python is also great at handling streaming data. If you need “real-time” data, you’re going to do well with Python. CERN and NASA deal with a lot of streaming data (in huge volumes) and utilize Python!
Another great reason to use Python is if you have obscure data types. With such a large user base, chances are, someone has built a package to help you parse your data. Using a package will save you time, trouble and money.
When to avoid Python
Don’t use Python if you need to do quick, out of the box data visualization. Simply put, it’s extremely frustrating and difficult to get a nice graph, chart, or map. While matplotlib has been expanded upon with
seaborn and a number of other tools, most lack instant gratification and require you to do a lot of customization. Fortunately, they are starting to catch up and I’m sure this will not be a problem for too much longer.
When to mix and match Python, R, and Tableau
At this point, you are probably able to guess when to mix and match. To be effective, start thinking through how your data is acquired, what the final output should look like and what transformations you will need to get there!
Let’s make up a random example that would be perfect for such a combination. Let’s say that you want to compare articles from the homepage of Fox News and CNN over the period of a month. How would you go about exploratory analysis? One possible workflow could be:
- Use Python to scrape the data regularly and store it in a database (using
- Use R to get the data into a tidy format for analysis (using
tidytext) and then write the results to a csv file
- Use Tableau to visualize the results (and perhaps even utilize models from R to determine sentiment and analyze keywords)
For the next post, perhaps I’ll build that out and see if it’s as simple as 1, 2, 3!
Also, if anyone who works for Tableau happens to be reading this, please make a connector to SQLite databases! It’s really frustrating not having that functionality.