Tag Archives: Data Science

Principal Component Analysis (PCA) – Part 4 – Python ML – OOP Basics

Goal of this post:

  1. Add principal component analysis (PCA)
  2. Refactor using inheritance
  3. Convert gradient descent to stochastic gradient descent
  4. Add new tests via pytest

What we are leaving for the next post:

  1. Discussing the need for packaging
  2. Start creating an actual package
Continue reading

Multivariate Linear Regression – Part 3 – Refactoring – Python ML – OOP Basics

Goal of this post:

  1. Move beyond single linear regression into multiple linear regression by utilizing gradient descent
  2. Refactor using inheritance
  3. Reconfigure our pytest to include the general case

What we are leaving for the next post:

  1. Add principal component analysis
  2. Refactor using inheritance
  3. Add new tests via pytest
Continue reading

Single Linear Regression – Part 2 – Testing – Python ML – OOP Basics

We have now entered part 2 of our series on object oriented programming in Python for machine learning. If you have not already done so, you may want to check out the previous post –> Part 1.

Goal of this post:

  1. Fit a model to find coefficients
  2. Find the RMSE, R^2, slope and intercept of the model
  3. Test our model using pytest

What we are leaving for the next post:

  1. Refactoring and utilizing inheritance
  2. Utilizing gradient descent
  3. Updating and adding tests
Continue reading

Single Linear Regression – Part 1 – Python ML – OOP Basics

Data scientists who come to the career without a software background (myself included) tend to use a procedural style of programming rather than taking an object oriented approach. Changing styles is a paradigm shift and really takes some time to wrap your mind around. Many of us who have been doing this for years still have trouble envisioning how objects can improve things. There are a lot of resources out there to help you understand this subject in more detail but I am going to take a “learn by doing” approach. The code used for this can be found on my GitHub.

Goal of this post:

  1. Build a very basic object to house our linear regression model
  2. Create a command line interface (CLI) to pass in different datasets
  3. Print the object to the screen in a user-friendly format

What we are leaving for the next post:

  1. Fitting a model to find coefficients
  2. Finding the RMSE, R^2, slope and intercept of the model
  3. Testing our model using pytest

Here we go!

Continue reading

ETL – Building a Data Pipeline With Python – Introduction – Part 1 of N

ETL (Extract, Transform, Load) is not always the favorite part of a data scientist’s job but it’s an absolute necessity in the real world. If you don’t understand this process, you will have a basic grasp on it by the time you’re done with these lessons. I will be covering:

  • Data exploration
    • Understanding your data
    • Looking for red flags
    • Utilizing both statistics and data visualization
  • Checking your data for issues
    • Identifying things outside of the “normal” range
    • Deciding what to do with NaN or missing values
    • Discovering data with the wrong data type
  • How to clean and transform your data
    • Utilize the pandas library
    • Utilize pyjanitor
    • Getting data into tidy format
  • Dealing with your database
    • Determining whether or not you actually need a database
    • Choosing the right database
      • Deciding between relational and NoSQL
    • Basic schema design and normalization
    • Using an ORM – SQLAlchemy to insert data
  • Building a data pipeline
    • Separate your ETL into parts
    • Utilize luigi to keep you on track
    • Error montitoring

Current thread posts on this topic:

  1. INTRODUCTION – PART 1
  2. DATA EXPLORATION – PART 2
  3. TESTING DATA – PART 3
  4. BASIC REPORTING – PART 4
  5. DATABASE, ORM & SQLALCHEMY – PART 5

Continue reading