Multivariate Linear Regression – Part 3 – Refactoring – Python ML – OOP Basics

Goal of this post:

  1. Move beyond single linear regression into multiple linear regression by utilizing gradient descent
  2. Refactor using inheritance
  3. Reconfigure our pytest to include the general case

What we are leaving for the next post:

  1. Add principal component analysis
  2. Refactor using inheritance
  3. Add new tests via pytest

In the previous post, we were able to easily solve for coefficients / betas / weights of our model. Due to the fact this could be solved via a straightforward analytical solution, it did not make sense to utilize more complicated techniques to solve for weights. However, as we enter the multidimensional world, things are slightly more complicated. We will use numpy and gradient descent to solve for the weights.

This process requires a lot of code. I will only go over the more complicated sections and leave the easier parts for you to dig through on GitHub.

Here we go!

First, we know that regression does not just apply to “linear regression” and certainly goes beyond “single linear regression.” There are a number of other algorithms that complete regression tasks: logistic regression, random forest, neural networks, the list goes on… So we want to create a base class that will contain all of the important data points and methods that will apply across all of these machine learning techniques. What do we mean by this?

All regression techniques must have the following input:

  1. Dependent / target / class / “y” variable
  2. Independent / predictor / “x” variable(s)
  3. Split for test / train data
  4. Random seed
  5. Max number of iterations
  6. Learning rate

That means, we can utilize this as a base class. Every method beyond this can inherit those characteristics in order to allow us to keep DRY code and take advantage of OOP.

Here is our base class Regression:

import numpy as np


class Regression:
    def __init__(
        self,
        independent_vars: np.ndarray,
        dependent_var: np.ndarray,
        iterations: int,
        learning_rate: float,
        train_split: float,
        seed: int,
    ):
        """
        :param independent_vars: np.ndarray
        :param dependent_var: np.array (one dimensional)
        :param iterations: int
        :param learning_rate: float
        :param train_split: float (0 < value < 1)
        :param seed: int
        """
        # Check data types
        if type(seed) != int:
            raise ValueError(f"seed value not an int")
        if type(iterations) != int or iterations <= 0:
            raise ValueError(f"Invalid iterations value")
        type_check_arrays = [
            type(independent_vars) == np.ndarray,
            type(dependent_var) == np.ndarray,
        ]
        if not all(type_check_arrays):
            raise ValueError(f"Type(s) of data for Regression class are not accurate")
        if (
            not type(learning_rate) == float
            and type(train_split) == float
            and train_split <= 1
        ):
            raise ValueError(f"learning_rate or train_split not acceptable input(s)")
        if dependent_var.shape[0] != independent_vars.shape[0]:
            raise ValueError(
                f"Dimension(s) of data for Regression class are not accurate"
            )

        all_data = np.column_stack((independent_vars, dependent_var))
        all_data = all_data.astype("float")
        np.random.seed(seed)
        np.random.shuffle(all_data)

        split_row = round(all_data.shape[0] * train_split)
        train_data = all_data[:split_row]
        test_data = all_data[split_row:]

        # Train
        self.independent_vars_train = train_data[:, :-1]
        self.dependent_var_train = train_data[:, -1:]

        # Test
        self.independent_vars_test = test_data[:, :-1]
        self.dependent_var_test = test_data[:, -1:]

        self.learning_rate = learning_rate
        self.iterations = iterations
        self.cost = []

You’ll notice that the first chunk of code defines what the class will need for input. The next bit simply checks that the input is acceptable. Once we get into the numpy data manipulations is where things get interesting. Due to the fact we are splitting the data, we combine the columns so the dependent and independent variables are lined up for test/train split and randomization. There are many ways to do this, but this is very clear and can be easily understood. Next, we initialize some self variables and add self.cost because it is nice to be able to have an idea of the cost reduction.

Our `Regression` class is complete and has everything we need to move forward with any type of regression. Since we are looking at linear regression, we will start by making a class for the general case. Previously, we created a SingleLinearRegression class. However, that methodology does not work for multiple variables. So rather than having a MultipleLinearRegression class as well (which could be done). A simple LinearRegression class handles both cases. In theory, you could add a switch statement to utilize the analytical solution when there’s only one variable, but we are going to use gradient descent to solve both problems. Feel free to modify the formula if that is important to you.

class LinearRegression(Regression):
    def __init__(
        self,
        independent_vars,
        dependent_var,
        iterations,
        learning_rate,
        train_split,
        seed,
    ):
        """
        All inherited from Regression class
        """
        super().__init__(
            independent_vars,
            dependent_var,
            iterations,
            learning_rate,
            train_split,
            seed,
        )

        # Add ones column to allow for beta 0
        self.independent_vars_train = np.c_[
            np.ones(len(self.independent_vars_train), dtype="int64"),
            self.independent_vars_train,
        ]
        self.independent_vars_test = np.c_[
            np.ones(len(self.independent_vars_test), dtype="int64"),
            self.independent_vars_test,
        ]

        # Initialize betas
        if len(self.independent_vars_train.shape) == 1:
            self.B = np.zeros(1)
        else:
            self.B = np.zeros(self.independent_vars_train.shape[1])

        # Automatically fit
        self.fit_gradient_descent()

    def predict(self, values_to_predict: np.ndarray):
        predicted_values = values_to_predict.dot(self.B).flatten()
        return predicted_values

    def find_gradient(self):
        estimate = self.predict(self.independent_vars_train)
        error = self.dependent_var_train.flatten() - estimate
        gradient = -(1.0 / len(self.independent_vars_train)) * error.dot(
            self.independent_vars_train
        )
        self.cost.append(np.power(error, 2))
        return gradient

    def fit_gradient_descent(self):
        for i in range(self.iterations):
            gradient = self.find_gradient()
            self.B = self.B - (self.learning_rate * gradient)

    def calculate_r_squared(self, independent_vars, dependent_var):
        sum_sq_r = np.sum((dependent_var - self.predict(independent_vars)) ** 2)
        sum_sq_t = np.sum((dependent_var - dependent_var.mean()) ** 2)
        return 1 - (sum_sq_r / sum_sq_t)

    def __str__(self):
        return f"""
            Model Results
            -------------
            Betas: {[i for i in zip(range(len(self.B)), self.B)]}
            R^2 Train: {self.calculate_r_squared(self.independent_vars_train, self.dependent_var_train[:, 0])}
            R^2 Test: {self.calculate_r_squared(self.independent_vars_test, self.dependent_var_test[:, 0])}
            """

You will notice now that we have inherited the base class (Regression) by simply defining our class as LinearRegression(Regression). In doing so, we do not need to redefine the input parameters. Another thing to note is that our LinearRegression class does not require any additional parameters to be passed. This is not always the case. If you needed to increase the number of inputs, the __init__ method allows for more variables to be passed.

After instantiation, we see that our data needs a bit of fine tuning which would not be the case for other models (which is why it was not included in the base class). Our independent variables need an extra np.ones column to be tacked on in order to calculate the intercept / beta 0 term. We apply this to both the test and train variables.

We also initialize the weights / betas / coefficients right after that in order to ensure we have the additional weight (vectors and matrices are sensitive to dimensions). If you are not familiar with numpy and linear algebra, you should make that a priority because almost all machine learning depends on it. Our predict method is self-explanatory. Moving on to the find_gradient method, we see that after a gradient is calculated, we are reducing its impact by the learning_rate (also referred to as alpha). We can then change the weights in our fit_gradient_descent method. We continue to do so for the amount of iterations given. Each iteration reduces the cost and brings our model closer to the reaching the global minimum. In many other algorithms we are only concerned with local minima, but here we will always tend to converge at the global minimum.

Rather than iterating a set number of times, a better solution tends to be setting a cost threshold at which you are willing to accept the error. This is not hard to implement but requires some exit conditions in case your model is not converging. We may follow up with this approach in the future.

The next step we care about, is R^2. In a general sense, this is how well your model “fits” the data. The calculation is laid out by the calculate_r_squared method and is self-explanatory.

Our __str__ method is understood by Python and is what comes out when you print your instance. It is intended to be for the end user to consume rather than for developer’s use. In our case, we print the betas / weights / coefficients in the order they line up with the independent variable input (starting with our beta 0 term for an intercept). We also see test and train R^2 values in order to help us see how the value compares across data sets and increase our understanding of expectations for future performance.

Let’s take a look at how our tests are structured! Within our tests directory, we now see another set of test data and another regression test (for multiple). Our conftest.py sets up our data just like before.

Taking a look at our test_multiple_linear_regression.py file:

import numpy as np
import pytest

from regression import LinearRegression


@pytest.fixture(scope="module")
def multiple_linear_regression_model(multiple_linear_regression_data):
    linear_regression_model = LinearRegression(
        independent_vars=multiple_linear_regression_data["independent_vars"],
        dependent_var=multiple_linear_regression_data["dependent_var"],
        iterations=10000,
        learning_rate=0.001,
        train_split=0.7,
        seed=123,
    )
    return linear_regression_model


def test_multiple_linear_regression_data_passing_correctly(
    multiple_linear_regression_model, multiple_linear_regression_data
):
    """
    Setup linear regression model
    :return:
    """
    assert (
        multiple_linear_regression_model.independent_vars_train.all()
        == multiple_linear_regression_data["independent_vars"].all()
    )
    assert (
        multiple_linear_regression_model.dependent_var_train.all()
        == multiple_linear_regression_data["dependent_var"].all()
    )
    assert type(multiple_linear_regression_model.independent_vars_train) == np.ndarray
    assert type(multiple_linear_regression_model.dependent_var_train) == np.ndarray


def test_multiple_linear_regression_coefficients(multiple_linear_regression_model):
    """
    Test regression model coefficients
    :return:
    """
    print(multiple_linear_regression_model)
    expected_coefficients = [
        (0, 0.31701131823834194),
        (1, 0.019291066504884043),
        (2, 0.03215158052302674),
        (3, 0.7237572134362669),
    ]
    no_of_betas = len(multiple_linear_regression_model.B)
    for n in range(no_of_betas):
        assert (
            pytest.approx(expected_coefficients[n][1], 0.001)
            == multiple_linear_regression_model.B[n]
        )


def test_multiple_linear_regression_r_squared(multiple_linear_regression_model):
    """
    Test regression model r_squared
    :return:
    """
    # Train Data
    train_r_squared = multiple_linear_regression_model.calculate_r_squared(
        multiple_linear_regression_model.independent_vars_train,
        multiple_linear_regression_model.dependent_var_train[:, 0],
    )

    test_r_squared = multiple_linear_regression_model.calculate_r_squared(
        multiple_linear_regression_model.independent_vars_test,
        multiple_linear_regression_model.dependent_var_test[:, 0],
    )

    assert pytest.approx(train_r_squared, 0.001) == 0.5179372162803452
    assert pytest.approx(test_r_squared, 0.001) == 0.2796260458401886

We have solved our regression problems and know what the results should be. However, due to the fact that float data types are very precise, we utilize pytest.approx to round to a specific set of significant digits. The same setup is written for single linear regression as well.

When running our tests, we can see that everything passes!

Whew! That was a long post, hope you found it useful if you made it this far. You should be able to think of a lot more test cases to run and will likely have to adjust your class to handle edge cases.