We have now entered part 2 of our series on object oriented programming in Python for machine learning. If you have not already done so, you may want to check out the previous post –> Part 1.

**Goal of this post:**

- Fit a model to find coefficients
- Find the RMSE, R^2, slope and intercept of the model
- Test our model using
`pytest`

**What we are leaving for the next post:**

- Refactoring and utilizing inheritance
- Utilizing gradient descent
- Updating and adding tests

Here we go!

This post will go over some of the basics of testing your OOP model. We will be using the `pytest`

package to do our testing. The package is documented with great examples here. Testing is a key part of feeling comfortable with writing code and modifying your model.

Moving on, it’s worth noting that we have slightly changed the directory structure of this project in minor ways:

We have also added to the `requirements.txt`

because we installed both `pytest`

and `pandas`

.

Let’s take a look at our new `regression.py`

file:

- Added
`numpy`

– a popular library to help deal with manipulating vectors and matrices - Removed
`predict`

from the current class (for now) - Added
`b1, b0, predicted_values, fit()`

which allow us to fit a model once we create an instance - Defined parameters for the mean of input data
- Created the
`fit()`

method to solve for coefficients - Created
`predict()`

method to utilize the model - Created
`root_mean_squared_error()`

and`r_squared()`

methods to assess the fit - Changed the
`__str__()`

method to reflect the values in a straightforward way

import numpy as np class SingleLinearRegression: def __init__(self, independent_var: np.array, dependent_var: np.array): """ Complete a single linear regression. :param independent_var: list :param dependent_var: list """ self.independent_var = independent_var self.dependent_var = dependent_var self.b1 = None self.b0 = None self.predicted_values = None self.fit() @property def independent_var_mean(self): return np.mean(self.independent_var) @property def dependent_var_mean(self): return np.mean(self.dependent_var) def fit(self): # Format: independent_var_hat = b1*dependent_var + b0 x_minus_mean = [x - self.independent_var_mean for x in self.independent_var] y_minus_mean = [y - self.dependent_var_mean for y in self.dependent_var] b1_numerator = sum([x * y for x, y in zip(x_minus_mean, y_minus_mean)]) b1_denominator = sum([(x - self.independent_var_mean) ** 2 for x in self.independent_var]) self.b1 = b1_numerator / b1_denominator self.b0 = self.dependent_var_mean - (self.b1 * self.independent_var_mean) def predict(self, values_to_predict: np.ndarray): predicted_values = values_to_predict * self.b1 + self.b0 return predicted_values def root_mean_squared_error(self): dependent_var_hat = self.predict(self.independent_var) sum_of_res = np.sum((dependent_var_hat - self.dependent_var) ** 2) rmse = np.sqrt(sum_of_res / len(dependent_var_hat)) return rmse def r_squared(self): dependent_var_hat = self.predict(self.independent_var) sum_of_sq = np.sum((self.dependent_var - self.dependent_var_mean) ** 2) sum_of_res = np.sum((self.dependent_var - dependent_var_hat) ** 2) return 1 - (sum_of_res / sum_of_sq) def __str__(self): return f""" Model Results ------------- b1: {round(self.b1, 2)} b0: {round(self.b0, 2)} RMSE: {round(self.root_mean_squared_error(), 2)} R^2: {round(self.r_squared(), 2)} """

You will notice that our math is relatively straightforward. The usage of list comprehensions allows us to easily complete our calculations utilizing the order of operations. You may also notice that the function immediately fits a model to the data when it’s instantiated. This is not a common practice, however, it works well for our use case.

Now that we have our model in place, there is plenty of testing to do. This is not the test driven development (TDD) way of doing things. I think TDD is a great way to work but it is very rigid and not conducive to blog posts. Let’s expand our directory structure to see how we will test our model. Here’s a quick breakdown:

`tests`

-> folder is easily discovered by`pytest`

and houses everything related to our testing`my_test_data`

-> holds a`csv`

with sample data to which we know what the results should be`regression`

-> holds tests related to the`regression.py`

file`conftest.py`

-> a setup for`pytest`

that allows us to easily pass “global” style variables and setup parameters for use throughout the tests

In our case, when we run `pytest`

the first thing it will run is `conftest.py`

. This is a “configuration” that allows us to pull our data one time, rather than having to read the `csv`

for each test function. While it is not a big deal in our case, you could imagine having to do this for a very large test suite that may require a lot of database queries. Here is what’s going on:

`pytest.fixture(scope='session')`

-> sets the variable`simple_linear_regression_data`

as a “global” variable. This can be called for the session. This variable can now be used and passed to all functions in the test suite. It will contain the`csv`

data and is returned in a dictionary, in our case

import pytest import pandas as pd import numpy as np @pytest.fixture(scope='session') def single_linear_regression_data() -> dict: """ Setup test data for :return: """ df = pd.read_csv('my_test_data/my_test_data.csv') yield { 'dependent_var': np.array(df['dependent_var']), 'independent_var': np.array(df['independent_var']) } return print('single_linear_regression_data fixture finished.')

Next, we will look at the only file we have written that contains tests – `test_single_linear_regression.py`

. Because we know what each method should return, we will ensure those results are accurate. There are ** a lot** more tests that should be run to check these modules (i.e. checking to see what happens when data of different types are passed in, expect errors in data, utilizing null values, etc.) For demonstration purposes, we will simply test cases that we know to be accurate, but feel free to add on to this. Here’s what’s going on:

`pytest.fixture(scope='module')`

-> creates an instance of our`SingleLinearRegression`

model utilizing the`single_linear_regression_data`

defined in`conftest.py`

`test_single_linear_regression_data_passing_correctly`

-> checks to see that data within the model is of the right type and that all input data matches what is stored in the model`test_single_linear_regression_fit`

-> checks that the model that was fit has the same coefficients as expected (to a certain degree of accuracy)`test_single_linear_regression_rmse`

and`test_single_linear_regression_r_squared`

-> check that the calculated values match (to a certain degree of accuracy)

import numpy as np import pytest from regression import SingleLinearRegression @pytest.fixture(scope='module') def reg_model(single_linear_regression_data): linear_regression_model = SingleLinearRegression(independent_var=single_linear_regression_data['independent_var'], dependent_var=single_linear_regression_data['dependent_var']) return linear_regression_model def test_single_linear_regression_data_passing_correctly(reg_model, single_linear_regression_data): """ Setup linear regression model :return: """ assert(reg_model.independent_var.all() == single_linear_regression_data['independent_var'].all()) assert(reg_model.dependent_var.all() == single_linear_regression_data['dependent_var'].all()) assert(type(reg_model.independent_var) == np.ndarray) assert(type(reg_model.dependent_var) == np.ndarray) def test_single_linear_regression_fit(reg_model): """ Test regression model coefficients :return: """ assert(pytest.approx(reg_model.b1, 0.01) == 1.14) assert(pytest.approx(reg_model.b0, 0.01) == 0.43) def test_single_linear_regression_rmse(reg_model): """ Test regression model root mean squared error :return: """ assert(pytest.approx(reg_model.root_mean_squared_error(), 0.02) == 0.31) def test_single_linear_regression_r_squared(reg_model): """ Test regression model r_squared :return: """ assert(pytest.approx(reg_model.r_squared(), 0.01) == 0.52)

These tests can be run in a multitude of ways. My favorite way is to utilize an IDE’s built in functionality that allows you to run tests independently, only a handful of tests, or all of the tests. In general, PyCharm is my favorite tool for the job. After running it, you should see all tests passing and the output will look something like this (results will vary depending on options being passed to `pytest`

)

There we have it, all tests passed in 0.03 seconds! We will continue to move forward and write tests as we move along to ensure functionality. This will be especially important when we refactor in the next post!