Author Archives: Scott Stoltzman

Principal Component Analysis (PCA) – Part 4 – Python ML – OOP Basics

Goal of this post:

  1. Add principal component analysis (PCA)
  2. Refactor using inheritance
  3. Convert gradient descent to stochastic gradient descent
  4. Add new tests via pytest

What we are leaving for the next post:

  1. Discussing the need for packaging
  2. Start creating an actual package
Continue reading

Multivariate Linear Regression – Part 3 – Refactoring – Python ML – OOP Basics

Goal of this post:

  1. Move beyond single linear regression into multiple linear regression by utilizing gradient descent
  2. Refactor using inheritance
  3. Reconfigure our pytest to include the general case

What we are leaving for the next post:

  1. Add principal component analysis
  2. Refactor using inheritance
  3. Add new tests via pytest
Continue reading

Single Linear Regression – Part 2 – Testing – Python ML – OOP Basics

We have now entered part 2 of our series on object oriented programming in Python for machine learning. If you have not already done so, you may want to check out the previous post –> Part 1.

Goal of this post:

  1. Fit a model to find coefficients
  2. Find the RMSE, R^2, slope and intercept of the model
  3. Test our model using pytest

What we are leaving for the next post:

  1. Refactoring and utilizing inheritance
  2. Utilizing gradient descent
  3. Updating and adding tests
Continue reading

Single Linear Regression – Part 1 – Python ML – OOP Basics

Data scientists who come to the career without a software background (myself included) tend to use a procedural style of programming rather than taking an object oriented approach. Changing styles is a paradigm shift and really takes some time to wrap your mind around. Many of us who have been doing this for years still have trouble envisioning how objects can improve things. There are a lot of resources out there to help you understand this subject in more detail but I am going to take a “learn by doing” approach. The code used for this can be found on my GitHub.

Goal of this post:

  1. Build a very basic object to house our linear regression model
  2. Create a command line interface (CLI) to pass in different datasets
  3. Print the object to the screen in a user-friendly format

What we are leaving for the next post:

  1. Fitting a model to find coefficients
  2. Finding the RMSE, R^2, slope and intercept of the model
  3. Testing our model using pytest

Here we go!

Continue reading

Building a Data Pipeline in Python – Part 5 of N – Database, ORM & SQLAlchemy

Adding data to your database

Many people focusing on ETL will eventually be utilizing a database. We will be examining a relational database, SQLite in this case, to store and process our data. If you are not a SQL expert, this can be a daunting task. Most relational databases require you to know keys, indices, relationships, data types, etc. While you still need an understanding of these to do things properly, you do not need to write the SQL when utilizing an object relational mapper (ORM) such as SQLAlchemy in Python.

While the ORM handles a lot of the operations, there is one other very important thing to keep in mind about an ORM, these types of tools can utilize many types of databases. In our case, we’re using SQLite, but if you needed to switch it over to MySQL or SQL Server, you wouldn’t have to change your code! *This is mostly true, some operations are available in certain databases but not in others.*

SQLAlchemy will write all of the SQL behind the scenes for you and this type of abstraction can be extremely powerful for those who do not need extremely high performance reads / writes.

In this example, we will take a look at a script that would take your data and insert it into a database utilizing SQLAlchemy. For simplicity, we are only going to utilize a few columns of data to create tables for: country, orders, status.

If you have not gone through the previous posts, please do so in order to understand where we are at in terms of functionality. For convenience, we moved all of the analysis into one folder within our repository to keep life simple.

Current thread posts on this topic:

  1. INTRODUCTION – PART 1
  2. DATA EXPLORATION – PART 2
  3. TESTING DATA – PART 3
  4. BASIC REPORTING – PART 4
  5. DATABASE, ORM & SQLALCHEMY – PART 5
Continue reading