Building a Data Pipeline in Python – Part 2 of N – Data Exploration

Initial data acquisition and data analysis

In order to get an idea of what our data looks like, we need to look at it! The Jupyter Notebook, embedded below, will show steps to load your data into Python and find some basic statistics to use them to identify potentially issues with new data that arrives.

This process is simply the exploratory step, we will build part of the pipeline in the next step. It’s imporant to have notebooks involved once in a while in order to make sure we know what we’re looking at.

Keep in mind, this is the first look at the data and we’re checking out some very basic testing. These tests will become more robust and meaningful as we continue to build out this pipeline.

Exploratory Analysis – When to Choose R, Python, Tableau or a Combination

Not all data analysis tools are created equal.

Recently, I started looking into data sets to compete in Go Code Colorado (check it out if you live in CO). The problem with such diversity in data sets is finding a way to quickly visualize the data and do exploratory analysis. While tools like Tableau make data visualization extremely easy, the data isn’t always properly formatted to be easily consumed. Here’s are a few tips to help speed up your exploratory data analysis!

We’ll use data from two sources to aid with this example:

Picking the right tool

Always be able to answer the following before choosing a tool:

