How to choose the best regression model for your ML application?

5 min readJun 6, 2020

With a variety of Machine Learning algorithms such as simple linear regression, polynomial linear regression , classification models such as logistic regression at your disposal you might be having a hard time choosing the best fit for your particular ML application. This article discusses the steps to choose the ideal model based on the R-squared and the adjusted R-squared intuition.

R-Squared Intuition

Consider a Linear regression model with a set of data points. The black line in the diagram below indicates the regression line. The dotted green lines indicate the residual distance between the data point and the regression line.

Now, the sum of the squares of the residual distance from the datapoint to the regression line can be denoted by the following equation.

Let’s now proceed to draw an average line along the Y-axis of the plot which indicates the average value of all the data points along the Y-axis.

The the sum total of the square of the distance from the data point to the average line along the Y-axis can be indicated by

The R-square value which we are discussing is represented by

The main thing here is to understand that the sum of squares of the distance between the data points and the regression line(SS res) should be minimal. In other words, the R-squared value should be closest to 1. The R-squared value is an indication of how good your line compared to the average. In an ideal scenario your regression line would perfectly pass through all of the data points making the sum of square of residuals equal to 0, hence resulting R-squared value is equal to 1. Also, the R-squared value can also be negative. However under most circumstances the value lies in between 0 and 1.

Adjusted R-Squared Intuition

Now, let’s consider a multiple linear regression equation in which there are more than one variable impacting the outcome.

The aim here is to make the sum of squares of residual closest to 0 as we expect the value of the R-square to be greater (closest to 1). Here, we examine if adding that extra variable to our model will make the prediction more accurate or not. Ideally when you add an extra variable, the R-squared value should not decrease. We can add as many variables as we would like to our model but it’s hard to say how it impacts our model. So we need a different formula to decide whether the model is a good fit or not. This is where the adjusted R-square intuition is used.

Here ‘p’ indicates the number of regressors (independent variables) and ’n’ indicates the sample size. If the variable you added ‘p’(also called the penalizing factor) decreases the adjusted R-squared value, then the model is not a good fit for your application. However, if the regressor ‘p’ results in the increase of the adjusted R-square value, then it can be concluded that the model is a good fit for your ML application.

How to find the adjusted R-square value in Python?

Python being one of the most common languages used in Machine Learning has API’s which helps you decide if the particular model is a good fit or not. However there are certain prerequisites which needs to be satisfied before you can come to a conclusion on whether or not a particular model suits the application. Firstly the data should be in the .csv format. It should contain the features in the first column and the dependant variable in the last column. The missing data needs to be pre processed and no categorical data should be present (the data pre processing should be performed).

To find any feature including the adjusted R-squared value, we can access the Scikit-learn Python API and search for ‘regression metrics’. The R-square score API helps us to determine the adjusted R-squared value of a particular model.

To evaluate the model which best suits the application, apply the above Scikit learn API in Python to obtain the adjusted R-squared value and choose the model which has the value closest to 1.

Pros and cons of various regression models

Each regression model has it’s own set of pro’s and con’s which needs to be considered before applying them to your ML application. A few of them are listed in the table below.

Conclusion

We now know how the R-squared and adjusted R-squared values impact the suitability of the regression model to your ML application. Also we discussed how the Scikit learn API in Python can be used to find these values which can be used to compare various models and then come to a conclusion if it’s a good fit for your machine learning application.