Tag Archives: Database

ETL – Building a Data Pipeline With Python – Introduction – Part 1 of N

ETL (Extract, Transform, Load) is not always the favorite part of a data scientist’s job but it’s an absolute necessity in the real world. If you don’t understand this process, you will have a basic grasp on it by the time you’re done with these lessons. I will be covering:

  • Data exploration
    • Understanding your data
    • Looking for red flags
    • Utilizing both statistics and data visualization
  • Checking your data for issues
    • Identifying things outside of the “normal” range
    • Deciding what to do with NaN or missing values
    • Discovering data with the wrong data type
  • How to clean and transform your data
    • Utilize the pandas library
    • Utilize pyjanitor
    • Getting data into tidy format
  • Dealing with your database
    • Determining whether or not you actually need a database
    • Choosing the right database
      • Deciding between relational and NoSQL
    • Basic schema design and normalization
    • Using an ORM – SQLAlchemy to insert data
  • Building a data pipeline
    • Separate your ETL into parts
    • Utilize luigi to keep you on track
    • Error montitoring

Continue reading