ETL – Building a Data Pipeline With Python – Introduction – Part 1 of N

ETL (Extract, Transform, Load) is not always the favorite part of a data scientist’s job but it’s an absolute necessity in the real world. If you don’t understand this process, you will have a basic grasp on it by the time you’re done with these lessons. I will be covering:

  • Data exploration
    • Understanding your data
    • Looking for red flags
    • Utilizing both statistics and data visualization
  • Checking your data for issues
    • Identifying things outside of the “normal” range
    • Deciding what to do with NaN or missing values
    • Discovering data with the wrong data type
  • How to clean and transform your data
    • Utilize the pandas library
    • Utilize pyjanitor
    • Getting data into tidy format
  • Dealing with your database
    • Determining whether or not you actually need a database
    • Choosing the right database
      • Deciding between relational and NoSQL
    • Basic schema design and normalization
    • Using an ORM – SQLAlchemy to insert data
  • Building a data pipeline
    • Separate your ETL into parts
    • Utilize luigi to keep you on track
    • Error montitoring

The best way to go through this series of lessons is to read the blog post and then walk through the Jupyter Notebook located within my GitHub repository.

Here is how to think of the ETL within your data pipeline process:

E – You read in your data from a source. This could be any number of things, perhaps a CSV file, REST Api, Excel file, etc.

T – Check to ensure the data is accurate and is what you would expect it to look like. Run tests on this data.

L – Insert the data into your database (if you need one)

While ETL is relatively straightforward, there are a LOT of ways to handle it and even MORE ways to build a data pipeline. I will walk you through the most straightforward and simple way to handle it. This assumes you have “small data” that is suitable for batch processing.

Here is the problem we will be solving: sales data is coming in to us every month in a CSV. It is relatively clean but we want to make sure that there are no errors in it to ensure the business stakeholders are not making decisions based off of bad data. We need to ingest the data, clean it, and store it in a relational database. This will require an automated pipeline that is triggered when the CSV arrives and we also need to design the schema.

Lessons will come out every 1-2 weeks, get ready for some fun!