Category Archives: Python

Building a Data Pipeline in Python – Part 2 of N – Data Exploration

Initial data acquisition and data analysis

In order to get an idea of what our data looks like, we need to look at it! The Jupyter Notebook, embedded below, will show steps to load your data into Python and find some basic statistics to use them to identify potentially issues with new data that arrives.

This process is simply the exploratory step, we will build part of the pipeline in the next step. It’s imporant to have notebooks involved once in a while in order to make sure we know what we’re looking at.

Keep in mind, this is the first look at the data and we’re checking out some very basic testing. These tests will become more robust and meaningful as we continue to build out this pipeline.

Continue reading

ETL – Building a Data Pipeline With Python – Introduction – Part 1 of N

ETL (Extract, Transform, Load) is not always the favorite part of a data scientist’s job but it’s an absolute necessity in the real world. If you don’t understand this process, you will have a basic grasp on it by the time you’re done with these lessons. I will be covering:

  • Data exploration
    • Understanding your data
    • Looking for red flags
    • Utilizing both statistics and data visualization
  • Checking your data for issues
    • Identifying things outside of the “normal” range
    • Deciding what to do with NaN or missing values
    • Discovering data with the wrong data type
  • How to clean and transform your data
    • Utilize the pandas library
    • Utilize pyjanitor
    • Getting data into tidy format
  • Dealing with your database
    • Determining whether or not you actually need a database
    • Choosing the right database
      • Deciding between relational and NoSQL
    • Basic schema design and normalization
    • Using an ORM – SQLAlchemy to insert data
  • Building a data pipeline
    • Separate your ETL into parts
    • Utilize luigi to keep you on track
    • Error montitoring

Continue reading

New Year, New Challenge – 100 Days of Code

Starting the 100 Days of Code ( #100DaysOfCode ) challenge

 

I am always looking to boost my coding skills and as I watch everyone make resolutions for the year, I couldn’t help but think I should try this challenge. In case you don’t know what I’m referring to, one resource is https://www.100daysofcode.com/ – which really gives you a good overview of what the challenge involves.

What will I be building?

I am a project-oriented person, so I will be building a web application that runs sentiment analysis on text data from APIs.

The basic topics I hope to cover:

  • Flask Application
    • User login
    • Store data from external APIs
    • Utilize PostgreSQL and MongoDB
    • Jinja2 templating
    • Back end API development
    • Luigi ETL pipeline

I will try and send out a blog update every week or two with highlights! I will also be updating GitHub as I go along. Part of the challenge is also posting on Twitter, so each day I’ll be using the hashtag #100DaysOfCode and you can follow me @stoltzmaniac

Exploratory Analysis – When to Choose R, Python, Tableau or a Combination

Not all data analysis tools are created equal.

Recently, I started looking into data sets to compete in Go Code Colorado (check it out if you live in CO). The problem with such diversity in data sets is finding a way to quickly visualize the data and do exploratory analysis. While tools like Tableau make data visualization extremely easy, the data isn’t always properly formatted to be easily consumed. Here’s are a few tips to help speed up your exploratory data analysis!

We’ll use data from two sources to aid with this example:

Picking the right tool

Always be able to answer the following before choosing a tool:

Continue reading

George Washington as a Constitutional Word Cloud

Is George Washington better looking on the dollar bill or represented by a word cloud built with the text of The Constitution of the USA?

A colleague recently asked me that exact question. If you want to be taken seriously in the data science world, you better be able to answer something like this!

I decided that it would be fun to show off a Python package by Andreas Mueller called word_cloud (here) to make a fun image with the text of the Constitution and an image of one of the Founding Fathers.

I must warn you, word clouds are like pie charts people like the way they look but clouds don’t provide much information. That said, this package is really neat because it allows you to easily turn text into images utilizing masks, colors, and numpy!

I’ll keep this post short, what you want to do is simple:

  1. Select an image which you would like to mimic in both color and shape
  2. Read your image into Python using numpy
  3. Read your text into Python using open() and read()
  4. Make your word cloud!

Continue reading