Building a Data Pipeline in Python – Part 3 of N – Testing Data

Simple testing of data: columns, data types, values

In a previous post, we walked through data exploration / visualization and tests to see if our data fit basic requirements. The Jupyter Notebook, embedded below, loads the data and tests it against some rules that start to push us in a direction that allows for more customization and flexibility of our process.

We are establishing a baseline and framework. We are still in the very early process of ETL but we can start to see what the future holds. This notebook covers:

  • Identification that all of the columns of data we need are being read in from the new file
  • Determining which columns are worth testing
  • Utilization of basic statistics of the data to find an expected range of values
  • Testing all of the above

Keep in mind, this notebook is intended to highlight conceptual components. Simply put, these methods are too basic to be used in practice. This includes: ridiculous try / except statements, excessive for loops, etc. The simplicity of the code should be easy to interpret and give a strong indication of what you should be doing to improve your processes!

As always, you can find the code for this on my github page.