Single Linear Regression – Part 1 – Python ML – OOP Basics

Data scientists who come to the career without a software background (myself included) tend to use a procedural style of programming rather than taking an object oriented approach. Changing styles is a paradigm shift and really takes some time to wrap your mind around. Many of us who have been doing this for years still have trouble envisioning how objects can improve things. There are a lot of resources out there to help you understand this subject in more detail but I am going to take a “learn by doing” approach. The code used for this can be found on my GitHub.

Goal of this post:

  1. Build a very basic object to house our linear regression model
  2. Create a command line interface (CLI) to pass in different datasets
  3. Print the object to the screen in a user-friendly format

What we are leaving for the next post:

  1. Fitting a model to find coefficients
  2. Finding the RMSE, R^2, slope and intercept of the model
  3. Testing our model using pytest

Here we go!

Definition: OOP = Object Oriented Programming

Our object above describes what we need to house in our object.

  • Data – Obviously… **note** in this case, the data needs to be in a specific format
  • Fit – Utilize the y = mx + b format that we all grew up with. We’ll write the code for this in the next post
  • Fit Results – After fitting the model, we typically need to be able to see the fit rather than just predicting results
  • Predictions – Make the model useful by being able to predict values provided by the user

What do we need as inputs?

  1. Independent variable values (typically ‘x’)
  2. Dependent variable values (typically ‘y’)
  3. Numeric value at which we want a prediction (similar to ‘x’)

We start by making a class and then we define what it takes as input within the __init__ method. In our case, we are asking for a list of independent_var and dependent_var with a single numeric value as predict. .

class SingleLinearRegression:

    def __init__(self, independent_var: list, dependent_var: list, predict: float):

        """
        Completes either a single or multiple linear regression. We will pass a single value to predict.
        :param independent_var: list
        :param dependent_var: list
        :param predict: float
        """
        self.independent_var = independent_var
        self.dependent_var = dependent_var
        self.predict = predict

Next, we know that we will be fitting a model and predicting results. This will utilize fit and predictions methods. We will hold off on adding the math until next post. Finally, we add the __str__ method which is called when you print(your_object) in order to make the output legible. You will find that there is another method called __repr__ available, but it is typically utilized for a different purpose. We will save this class by itself in a file called linear_regression.py.

class SingleLinearRegression:

    def __init__(self, independent_var: list, dependent_var: list, predict: float):

        """
        Completes either a single or multiple linear regression. We will pass a single value to predict.
        :param independent_var: list
        :param dependent_var: list
        :param predict: float
        """
        self.independent_var = independent_var
        self.dependent_var = dependent_var
        self.predict = predict

    def fit(self) -> dict:
        pass

    def predictions(self) -> dict:
        pass

    def __str__(self):
        return f"""
            This class returns a dictionary of results from your on your linear regression:
            {{
                'independent_var': {self.independent_var},
                'dependent_var': {self.dependent_var},
                'fit': {{
                    'coefficient': coefficient,
                    'constant': constant,
                    'r_squared': r_squared,
                    'p_values': 'p_values'
                    }}, 
                'predictions': {{
                    'predict': {self.predict},
                    'result': result_of_predictions.
                    }}
            }}
            :return: dict
            """

There we have it, our first class. By itself, this doesn’t do a whole lot for us. We have to convert our class into an instance with all of our inputs. Before we go too far, let’s take a look at our folder structure.

We have a data directory with 2 csv files to use as “data”. We also have a linear_regression.py file which holds our SingleLinearRegression class that we just created. We also have a run_me.py file which will be used to run everything. You will also notice the requirements.txt file, this houses all of the required packages.

What should our run_me.py contain? It needs to import our SimpleLinearRegression class, take data and print out results. Looking at my_function() below shows us that we will need to provide a dataset (filename and location) and the predict value. Note that reading in the csv data is quite long, we will trim this down in the next post. We instantiate our object with our data utilizing the dependent_data and independent_data that was read from the dataset.

import csv
import click
from linear_regression import SingleLinearRegression


def my_function(dataset: str, predict: int):
    print('Starting run_me.py')

    # Read in csv data
    independent_data = []
    dependent_data = []
    with open(dataset, 'r') as csvfile:
        reader = csv.reader(csvfile)
        next(reader, None)  # Removes header row
        for row in reader:
            independent_data.append(row[0])
            dependent_data.append(row[1])

    # Create instance of SingleLinearRegression model
    single_linear_regression = SingleLinearRegression(
        independent_var=independent_data,
        dependent_var=dependent_data,
        predict=predict
    )

    print(single_linear_regression)

We aren’t quite done, this will not do anything if we run the run_me.py file. We need to set this up to take an arbitrary dataset in and run. This is where the click library comes in handy. There are a lot of different ways to pass arguments in from the CLI, but I prefer click for its simplicity.

Each @click.option should be self explanatory. You provide the dataset location and the predicted value. The rest is handled in the program. We have also set default values for each. Utilizing the __name__ and main() is pretty typical in Python and you will see it all over the place, it’s a good way to setup your projects.

import csv
import click
from linear_regression import SingleLinearRegression


@click.command()
@click.option('-d', '--dataset', default='./data/fake_data.csv',
              help='Dataset with independent variable in first column and dependent variable in second. \
              Dataset has a header row.')
@click.option('-p', '--predict', default=2.5,
              help='Dependent variable value you would like to use the fit to predict.')
def main(dataset: str, predict: int):
    print('Starting run_me.py')

    # Read in csv data
    independent_data = []
    dependent_data = []
    with open(dataset, 'r') as csvfile:
        reader = csv.reader(csvfile)
        next(reader, None)  # Removes header row
        for row in reader:
            independent_data.append(row[0])
            dependent_data.append(row[1])

    # Create instance of SingleLinearRegression model
    single_linear_regression = SingleLinearRegression(
        independent_var=independent_data,
        dependent_var=dependent_data,
        predict=predict
    )

    print(single_linear_regression)


if __name__ == '__main__':
    main()

Finally, we can run this! Since we have default values (utilizing the dataset fake_data.csv), we can simply run:

> python run_me.py

Terminal Output:

Starting run_me.py

            This class returns a dictionary of results from your on your linear regression:
            {
                'independent_var': ['1', '2', '3'],
                'dependent_var': ['5', '6', '8'],
                'fit': {
                    'coefficient': coefficient,
                    'constant': constant,
                    'r_squared': r_squared,
                    'p_values': 'p_values'
                    }, 
                'predictions': {
                    'predict': 2.5,
                    'result': result_of_predictions.
                    }
            }
            :return: dict

We can see that we have a nice description of our output, including dynamically populated values for independent_var, dependent_var, and predict. If we want to pass a different dataset or predict value in it is simple…

> python run_me.py -d data/fake_data2.csv -p 312

Terminal Output:

Starting run_me.py

            This class returns a dictionary of results from your on your linear regression:
            {
                'independent_var': ['100', '200', '300'],
                'dependent_var': ['500', '600', '800'],
                'fit': {
                    'coefficient': coefficient,
                    'constant': constant,
                    'r_squared': r_squared,
                    'p_values': 'p_values'
                    }, 
                'predictions': {
                    'predict': 312.0,
                    'result': result_of_predictions.
                    }
            }
            :return: dict

You’ll notice that the variables have changed in the output! In the next post we will dive into making something a bit more useful.

I need to state this explicitly, I am not an expert in object oriented design. These types of patterns are very specific and experts in the field have been doing this for many years with a lot of mentoring. If you are taking anything into a production environment that people depend on, please take the time to have someone with lots of experience take a look at your code to help you gain confidence and grow your skills.