Category Archives: ETL

Building a Data Pipeline in Python – Part 5 of N – Database, ORM & SQLAlchemy

Adding data to your database

Many people focusing on ETL will eventually be utilizing a database. We will be examining a relational database, SQLite in this case, to store and process our data. If you are not a SQL expert, this can be a daunting task. Most relational databases require you to know keys, indices, relationships, data types, etc. While you still need an understanding of these to do things properly, you do not need to write the SQL when utilizing an object relational mapper (ORM) such as SQLAlchemy in Python.

While the ORM handles a lot of the operations, there is one other very important thing to keep in mind about an ORM, these types of tools can utilize many types of databases. In our case, we’re using SQLite, but if you needed to switch it over to MySQL or SQL Server, you wouldn’t have to change your code! *This is mostly true, some operations are available in certain databases but not in others.*

SQLAlchemy will write all of the SQL behind the scenes for you and this type of abstraction can be extremely powerful for those who do not need extremely high performance reads / writes.

In this example, we will take a look at a script that would take your data and insert it into a database utilizing SQLAlchemy. For simplicity, we are only going to utilize a few columns of data to create tables for: country, orders, status.

If you have not gone through the previous posts, please do so in order to understand where we are at in terms of functionality. For convenience, we moved all of the analysis into one folder within our repository to keep life simple.

Current thread posts on this topic:

  1. INTRODUCTION – PART 1
  2. DATA EXPLORATION – PART 2
  3. TESTING DATA – PART 3
  4. BASIC REPORTING – PART 4
  5. DATABASE, ORM & SQLALCHEMY – PART 5
Continue reading

Building a Data Pipeline in Python – Part 4 of N – Basic Reporting

Building a report that passes tests

At this point, we have seen what our data looks like, how it is stored, and what some basic tests might look like. In this post, we start to look at how this might be created as a report to aid in the ETL process.

Many companies find themselves in positions where a CSV (or something similar) is delivered from outside of their organization. In this example, we are assuming that it is being placed into a folder called “new_data”. Our code picks up the file and compares it to what we would expect in terms of test data in order to decide whether or not to move forward with the ETL process.

This Jupyter notebook could be processed each time the file is updated and could be sent to stakeholders before data is processed. It contains a very basic level of testing and visualization, but the idea should get you started. When it runs, tests confirm whether or not the data fits within certain constraints and passes some integrity tests. The data is then plotted and a final output at the bottom shows which tests have passed / failed.

Current thread posts on this topic:

  1. INTRODUCTION – PART 1
  2. DATA EXPLORATION – PART 2
  3. TESTING DATA – PART 3
  4. BASIC REPORTING – PART 4
  5. DATABASE, ORM & SQLALCHEMY – PART 5
Continue reading

Building a Data Pipeline in Python – Part 3 of N – Testing Data

Simple testing of data: columns, data types, values

In a previous post, we walked through data exploration / visualization and tests to see if our data fit basic requirements. The Jupyter Notebook, embedded below, loads the data and tests it against some rules that start to push us in a direction that allows for more customization and flexibility of our process.

We are establishing a baseline and framework. We are still in the very early process of ETL but we can start to see what the future holds. This notebook covers:

  • Identification that all of the columns of data we need are being read in from the new file
  • Determining which columns are worth testing
  • Utilization of basic statistics of the data to find an expected range of values
  • Testing all of the above

Current thread posts on this topic:

  1. INTRODUCTION – PART 1
  2. DATA EXPLORATION – PART 2
  3. TESTING DATA – PART 3
  4. BASIC REPORTING – PART 4
  5. DATABASE, ORM & SQLALCHEMY – PART 5
Continue reading

Building a Data Pipeline in Python – Part 2 of N – Data Exploration

Initial data acquisition and data analysis

In order to get an idea of what our data looks like, we need to look at it! The Jupyter Notebook, embedded below, will show steps to load your data into Python and find some basic statistics to use them to identify potentially issues with new data that arrives.

This process is simply the exploratory step, we will build part of the pipeline in the next step. It’s imporant to have notebooks involved once in a while in order to make sure we know what we’re looking at.

Keep in mind, this is the first look at the data and we’re checking out some very basic testing. These tests will become more robust and meaningful as we continue to build out this pipeline.

Current thread posts on this topic:

  1. INTRODUCTION – PART 1
  2. DATA EXPLORATION – PART 2
  3. TESTING DATA – PART 3
  4. BASIC REPORTING – PART 4
  5. DATABASE, ORM & SQLALCHEMY – PART 5
Continue reading

ETL – Building a Data Pipeline With Python – Introduction – Part 1 of N

ETL (Extract, Transform, Load) is not always the favorite part of a data scientist’s job but it’s an absolute necessity in the real world. If you don’t understand this process, you will have a basic grasp on it by the time you’re done with these lessons. I will be covering:

  • Data exploration
    • Understanding your data
    • Looking for red flags
    • Utilizing both statistics and data visualization
  • Checking your data for issues
    • Identifying things outside of the “normal” range
    • Deciding what to do with NaN or missing values
    • Discovering data with the wrong data type
  • How to clean and transform your data
    • Utilize the pandas library
    • Utilize pyjanitor
    • Getting data into tidy format
  • Dealing with your database
    • Determining whether or not you actually need a database
    • Choosing the right database
      • Deciding between relational and NoSQL
    • Basic schema design and normalization
    • Using an ORM – SQLAlchemy to insert data
  • Building a data pipeline
    • Separate your ETL into parts
    • Utilize luigi to keep you on track
    • Error montitoring

Current thread posts on this topic:

  1. INTRODUCTION – PART 1
  2. DATA EXPLORATION – PART 2
  3. TESTING DATA – PART 3
  4. BASIC REPORTING – PART 4
  5. DATABASE, ORM & SQLALCHEMY – PART 5

Continue reading