Single Linear Regression – Part 2 – Testing – Python ML – OOP Basics

We have now entered part 2 of our series on object oriented programming in Python for machine learning. If you have not already done so, you may want to check out the previous post –> Part 1.

Goal of this post:

  1. Fit a model to find coefficients
  2. Find the RMSE, R^2, slope and intercept of the model
  3. Test our model using pytest

What we are leaving for the next post:

  1. Refactoring and utilizing inheritance
  2. Utilizing gradient descent
  3. Updating and adding tests

Here we go!

This post will go over some of the basics of testing your OOP model. We will be using the pytest package to do our testing. The package is documented with great examples here. Testing is a key part of feeling comfortable with writing code and modifying your model.

Moving on, it’s worth noting that we have slightly changed the directory structure of this project in minor ways:

We have also added to the requirements.txt because we installed both pytest and pandas.

Let’s take a look at our new regression.py file:

  • Added numpy – a popular library to help deal with manipulating vectors and matrices
  • Removed predict from the current class (for now)
  • Added b1, b0, predicted_values, fit() which allow us to fit a model once we create an instance
  • Defined parameters for the mean of input data
  • Created the fit() method to solve for coefficients
  • Created predict() method to utilize the model
  • Created root_mean_squared_error() and r_squared() methods to assess the fit
  • Changed the __str__() method to reflect the values in a straightforward way
import numpy as np

class SingleLinearRegression:

    def __init__(self, independent_var: np.array, dependent_var: np.array):
        """
        Complete a single linear regression.
        :param independent_var: list
        :param dependent_var: list
        """
        self.independent_var = independent_var
        self.dependent_var = dependent_var
        self.b1 = None
        self.b0 = None
        self.predicted_values = None
        self.fit()

    @property
    def independent_var_mean(self):
        return np.mean(self.independent_var)

    @property
    def dependent_var_mean(self):
        return np.mean(self.dependent_var)

    def fit(self):
        # Format: independent_var_hat = b1*dependent_var + b0
        x_minus_mean = [x - self.independent_var_mean for x in self.independent_var]
        y_minus_mean = [y - self.dependent_var_mean for y in self.dependent_var]
        b1_numerator = sum([x * y for x, y in zip(x_minus_mean, y_minus_mean)])
        b1_denominator = sum([(x - self.independent_var_mean) ** 2 for x in self.independent_var])
        self.b1 = b1_numerator / b1_denominator
        self.b0 = self.dependent_var_mean - (self.b1 * self.independent_var_mean)

    def predict(self, values_to_predict: np.ndarray):
        predicted_values = values_to_predict * self.b1 + self.b0
        return predicted_values

    def root_mean_squared_error(self):
        dependent_var_hat = self.predict(self.independent_var)
        sum_of_res = np.sum((dependent_var_hat - self.dependent_var) ** 2)
        rmse = np.sqrt(sum_of_res / len(dependent_var_hat))
        return rmse

    def r_squared(self):
        dependent_var_hat = self.predict(self.independent_var)
        sum_of_sq = np.sum((self.dependent_var - self.dependent_var_mean) ** 2)
        sum_of_res = np.sum((self.dependent_var - dependent_var_hat) ** 2)
        return 1 - (sum_of_res / sum_of_sq)

    def __str__(self):
        return f"""
            Model Results
            -------------
            b1: {round(self.b1, 2)}
            b0: {round(self.b0, 2)}
            RMSE: {round(self.root_mean_squared_error(), 2)}
            R^2: {round(self.r_squared(), 2)}
            """

You will notice that our math is relatively straightforward. The usage of list comprehensions allows us to easily complete our calculations utilizing the order of operations. You may also notice that the function immediately fits a model to the data when it’s instantiated. This is not a common practice, however, it works well for our use case.

Now that we have our model in place, there is plenty of testing to do. This is not the test driven development (TDD) way of doing things. I think TDD is a great way to work but it is very rigid and not conducive to blog posts. Let’s expand our directory structure to see how we will test our model. Here’s a quick breakdown:

  • tests -> folder is easily discovered by pytest and houses everything related to our testing
    • my_test_data -> holds a csv with sample data to which we know what the results should be
    • regression -> holds tests related to the regression.py file
    • conftest.py -> a setup for pytest that allows us to easily pass “global” style variables and setup parameters for use throughout the tests

In our case, when we run pytest the first thing it will run is conftest.py. This is a “configuration” that allows us to pull our data one time, rather than having to read the csv for each test function. While it is not a big deal in our case, you could imagine having to do this for a very large test suite that may require a lot of database queries. Here is what’s going on:

  • pytest.fixture(scope='session') -> sets the variable simple_linear_regression_data as a “global” variable. This can be called for the session. This variable can now be used and passed to all functions in the test suite. It will contain the csv data and is returned in a dictionary, in our case
import pytest
import pandas as pd
import numpy as np


@pytest.fixture(scope='session')
def single_linear_regression_data() -> dict:
    """
    Setup test data for
    :return:
    """
    df = pd.read_csv('my_test_data/my_test_data.csv')
    yield {
        'dependent_var': np.array(df['dependent_var']),
        'independent_var': np.array(df['independent_var'])
    }
    return print('single_linear_regression_data fixture finished.')

Next, we will look at the only file we have written that contains tests – test_single_linear_regression.py. Because we know what each method should return, we will ensure those results are accurate. There are a lot more tests that should be run to check these modules (i.e. checking to see what happens when data of different types are passed in, expect errors in data, utilizing null values, etc.) For demonstration purposes, we will simply test cases that we know to be accurate, but feel free to add on to this. Here’s what’s going on:

  • pytest.fixture(scope='module') -> creates an instance of our SingleLinearRegression model utilizing the single_linear_regression_data defined in conftest.py
  • test_single_linear_regression_data_passing_correctly -> checks to see that data within the model is of the right type and that all input data matches what is stored in the model
  • test_single_linear_regression_fit -> checks that the model that was fit has the same coefficients as expected (to a certain degree of accuracy)
  • test_single_linear_regression_rmse and test_single_linear_regression_r_squared -> check that the calculated values match (to a certain degree of accuracy)
import numpy as np
import pytest

from regression import SingleLinearRegression


@pytest.fixture(scope='module')
def reg_model(single_linear_regression_data):
    linear_regression_model = SingleLinearRegression(independent_var=single_linear_regression_data['independent_var'],
                                                     dependent_var=single_linear_regression_data['dependent_var'])
    return linear_regression_model


def test_single_linear_regression_data_passing_correctly(reg_model, single_linear_regression_data):
    """
    Setup linear regression model
    :return:
    """
    assert(reg_model.independent_var.all() == single_linear_regression_data['independent_var'].all())
    assert(reg_model.dependent_var.all() == single_linear_regression_data['dependent_var'].all())
    assert(type(reg_model.independent_var) == np.ndarray)
    assert(type(reg_model.dependent_var) == np.ndarray)


def test_single_linear_regression_fit(reg_model):
    """
    Test regression model coefficients
    :return:
    """
    assert(pytest.approx(reg_model.b1, 0.01) == 1.14)
    assert(pytest.approx(reg_model.b0, 0.01) == 0.43)


def test_single_linear_regression_rmse(reg_model):
    """
    Test regression model root mean squared error
    :return:
    """
    assert(pytest.approx(reg_model.root_mean_squared_error(), 0.02) == 0.31)


def test_single_linear_regression_r_squared(reg_model):
    """
    Test regression model r_squared
    :return:
    """
    assert(pytest.approx(reg_model.r_squared(), 0.01) == 0.52)

These tests can be run in a multitude of ways. My favorite way is to utilize an IDE’s built in functionality that allows you to run tests independently, only a handful of tests, or all of the tests. In general, PyCharm is my favorite tool for the job. After running it, you should see all tests passing and the output will look something like this (results will vary depending on options being passed to pytest)

There we have it, all tests passed in 0.03 seconds! We will continue to move forward and write tests as we move along to ensure functionality. This will be especially important when we refactor in the next post!