Matrix used to optimise regression model made on the developer survey data from 2017 by stack overflow.
Since my regression model was up and running, I have been thinking about it’s performance. If it was a classification model, a confusion matrix would signify the performance of that model. However, in case of a regression model it’s not that simple. This article aims to shed a light on the matrix used to optimise the regression model and how it functions.
r2_score or R-squared
Commonly known as R-squared, the coefficient of determination defines the confidence by which we can delineate the dependent feature (y) based on the independent features (X).
The value of R-squared usually computed using sklearn.metrics package in python has a spectrum of 0 to 1, with 0 being no relation at all and the rare case, a clear explanation of the variance of y, given by 1, leading to an optimised model.
For example, a value of 0.68 means that, with 68% confidence we can predict the value of target variable (y) with the help of given independent features (X).
How is it being calculated?
The single lined python code sklearn.metrics.r2_score(y_true, y_pred) gives us the value of r-squared, but, the main function makes three major calculations in the back-end.
- RSS or SSE
RSS or SSE
Residual Sum of Squares (RSS) or Sum of Squares Error (SSE) is the measure of the unexplained variation of the target variable (y) by our regression line (y_hat). This is the squared sum of the difference between true value & predicted value.
Sum of Squares due to Regression (SSR) is the measure of how well our regression line fits the data. This is the squared sum of the difference between mean value & predicted value.
Sum of Squares Total (SST) is the measure of the total variability of the data around the mean. This is the squared sum of the difference between mean value & true value.
Can you see the relation?
As per the above descriptions, these three terms are related as:
SST = SSR + SSE
As mentioned above, the higher the value of R-squared, the better is our model. Value of r-squared increases with an increase in the number of independent features.
This means, introducing new independent features to improve r-squared makes the model better, right?
NO, because, even a column like “whether a person is a left-handed or a right-handed batman” to predict developer’s salary could improve r-squared, but not necessarily model’s performance.
Unlike R-squared, the adjusted r-squared will penalise for adding the columns which don’t have any significance in predicting the target variable.
where n is the number of rows and m is the number of columns
The value of Adjusted R-squared :
- Increases, if r-squared show a significant increase
- Decreases, if r-squared doesn’t show a significant increase
- We should consider more columns to make our predictive model better.
- Adding unnecessary columns will reduce the performance of the model.
So, the real question remains:
Is the model that I’ve made, optimised ?
To see the results and the approach I followed, go to my GitHub repo available here.