## Learn how to use these measures to evaluate the goodness of fit of Linear and certain Nonlinear regression models

Published in

·

16 min read

·

Mar 6, 2021

--

One of *the* most used and therefore misused measures in Regression Analysis is R² (pronounced R-squared). It’s sometimes called by its long name: *coefficient of determination* and it’s frequently confused with the coefficient of correlation r² . See it’s getting baffling already!

The technical definition of R² is that it is the proportion of variance in the response variable ** y** that your regression model is able to “explain” via the introduction of regression variables.

Clearly, that doesn’t do a whole lot to clear the air.

Hence we appeal to the familiar visual of a linear regression line superimposed on a cloud of *(y,x)* points:

The flat horizontal orange line represents the Mean Model. The Mean Model is the simplest model that you can build for your data. For every x value, the mean model predicts the same *y* value and that value is the mean of your ** y **vector. In this case, it happens to be 38.81 x 10000 New Taiwan Dollar/Ping where one Ping is 3.3 meter².

We can do better than the Mean Model at explaining the variance in ** y**. For that, we need to add one or more regression variables. We will start with one such variable — the age of the house. The red line in the plot represents the predictions of a Linear Regression Model when it’s fitted on the

*(*

*y**,*

*X**)*data set where

**=HOUSE PRICE and**

*y***=HOUSE AGE. As you can see, the Linear Model with one variable fits a little better than the Mean Model.**

*X**R² lets you quantify just how much better the Linear model fits the data as compared to the Mean Model.*

Let’s zoom into a portion of the above graph:

In the above plot, *(y_i — y_mean)* is the error made by the Mean Model in predicting *y_i*. If you calculate this error for each value of ** y** and then calculate the sum of the square of each error, you will get a quantity that is proportional to the variance in

**. It is known as the Total Sum of Square TSS.**

*y*The Total Sum of Squares is proportional to the variance in your data. It is the variance that the Mean Model wasn’t able to explain.

Because *TSS/N* is the actual variance in ** y**, the TSS is proportional to the total variance in your data.

Being the sum of squares, the TSS for any data set is always non-negative.

The Mean Model is a very simple model. It contains only one parameter which is the mean of the dependent variable ** y** and it is represented as follows:

The Mean Model is also sometimes known as the Null Model or the Intercept only Model. But this interchangeability of definitions is appropriate only when the Null or Intercept Only model is fitted, i.e. trained on the data set. That’s the only situation in which the Intercept will become the unconditional mean of

.y

As mentioned earlier, if you want to better than the Mean Model in explaining the variance in ** y**, you need to add one or more regression variables. Let’s look closely at how adding the regression variable HOUSE AGE has helped the Linear Model in reducing the prediction error:

In the above plot, *(y_i — y_pred_i) *is the error made by the linear regression model in predicting *y_i*. This quantity is known as the **residual error** or simply the **residual**.

In the above plot, the residual error is clearly less than the prediction error of the Mean Model. Such improvement is not guaranteed. In a sub-optimal or badly constructed Linear Model, the residual error could be more than the prediction error of the Mean Model.

If you calculate this residual error for each value of ** y **and then calculate the sum of the square of each such residual error, you will get a quantity that is proportional to the prediction error of the Linear Model. It is known as the Residual Sum of Square RSS.

The Residual Sum of Squares captures the prediction error of your custom Regression Model.

Being the sum of squares, the RSS for a regression model is always non-negative.

Thus, *(Residual Sum of Squares)/(Total Sum of Squares)* is the fraction of the total variance in ** y**, that your regression model wasn’t able to explain.

Conversely:

*1 — (Residual Sum of Squares)/(Total Sum of Squares) *is the fraction of the variance in ** y **that your regression model

*was*able to explain.

We will now state the formula for *R²* in terms of RSS and TSS as follows:

Here is the Python code that produced the above plot:

And here is the link to the data set.

For Linear Regression Models that are fitted (i.e. trained) using the Ordinary Least Squares (OLS) Estimation technique, the range of R² is 0 to 1. Consider the following plot:

In the above plot, one can see that the residual error *(y_i — y_pred_i)*² is less than the total error *(y_i — y_mean)². *It can be shown that if you fit a Linear Regression Model to the above data by using the OLS technique, i.e. by minimizing the sum of squares of residual errors (RSS), the worst that you can do is the Mean Model. But the sum of squares of residual errors of the Mean Model is simply TSS, i.e. for the Mean Model, RSS = TSS.

Hence for OLS linear regression models, *RSS ≤ TSS*.

Since *R² =1 — RSS/TSS*, in the case of a perfect fit, RSS=0 and R² =1. In the worst case, RSS=TSS and R² = 0.

For Ordinary Least Squares Linear Regression Models, R² ranges from 0 to 1

Many non-linear regression models do *not *use the Ordinary Least Squares Estimation technique to fit the model. Examples of such nonlinear models include:

- The
**exponential, gamma and inverse-Gaussian**regression models used for continuously varyingin the range (-∞, ∞).*y* **Binary choice models such as the Logit (a.k.a. Logistic) and Probit**and their variants such as Ordered Probit used for y = 0 or 1, and the general class of Binomial regression models.- The
**Poisson, Generalized Poisson and the Negative Binomial**regression models for discrete non-negative*y**ϵ [0, 1, 2, …, ∞).*i.e. models for counts based data sets.

The model fitting procedure of these nonlinear models is not based on progressively minimizing the sum of squares of residual errors (RSS) and therefore the optimally fitted model could have a residual sum of squares that is greater than total sum of squares. That means, R² for such models can be a negative quantity. As such, R² is not a useful goodness-of-fit measure for most nonlinear models.

R-squared is not a useful goodness-of-fit measure for most nonlinear regression models.

A notable exception is regression models that are fitted using the **Nonlinear Least Squares** (**NLS**) estimation technique. The NLS estimator seeks to minimizes the sum of squares of residual errors thereby making R² applicable to NLS regression models.

Later in this article, we’ll look at some alternatives to R-squared for nonlinear regression models.

Let’s look at the following figure again:

In the above plot, *(y_pred_i — y_mean) *is the reduction in prediction error that we achieved by adding a regression variable **HOUSE_AGE_YEARS **to our model.

If you calculate this difference for each value of ** y** and then calculate the sum of the square of each difference, you will get a quantity that is proportional to the variance in

**that the Linear Regression model**

*y**was*able to explain. It is known as the Explained Sum of Square ESS.

The Explained Sum of Squares is proportional to the variance in your data that your regression model

wasable to explain.

Let’s do some math.

From the above plot, one can see that:

It can be shown that when the Least Squares Estimation technique is used to fit a linear regression model, the term *2*(y_i — y_pred)*(y_pred — y_mean) is 0.*

So for the special case of OLS Regression Model:

In other words:

The linear regression model that we have used to illustrate the concepts has been fitted on a curated version of the New Taipei City Real Estate data set. Let’s see how to build this linear model and find the R² score for it.

We’ll begin by importing all the required libraries:

**import **pandas **as **pd

**from **matplotlib **import **pyplot **as **plt

**from **statsmodels.regression.linear_model **import **OLS **as **OLS

**import **statsmodels.api **as **sm

Next, let’s read in the data file using Pandas. You can download the data set from here.

`df = pd.read_csv(`**'taiwan_real_estate_valuation_curated.csv'**, header=0)

Print the top 10 rows:

Our dependent ** y** variable is

**HOUSE_PRICE_PER_UNIT_AREA**and our explanatory a.k.a. regression a.k.a.

**variable is**

*X***HOUSE_AGE_YEARS**.

We’ll carve out the ** y** and

**matrices:**

*X*`y = df[`**'HOUSE_PRICE_PER_UNIT_AREA'**]

X = df[**'HOUSE_AGE_YEARS'**]

Since houses of age zero years, i.e. new houses will also have some non-zero price, we need to add a y intercept. This is the ‘** β0**’ in the equation of the straight line:

*y_pred**=*

*β1****

*X**+*

*β0*`X = sm.`**add_constant**(X)

Next, we build and fit the OLS regression model and print the training summary:

`olsr_model = `**OLS**(**endog**=y, **exog**=X)

olsr_results = olsr_model.**fit**()print(olsr_results.**summary**())

Here’s the output we get:

We see that the R² is 0.238. R² is not very large indicating a weak linear relationship between **HOUSE_PRICE_PER_UNIT_AREA **and** HOUSE_AGE_YEARS**.

The equation of the fitted model is as follows:

*HOUSE_PRICE_PER_UNIT_AREA_pred = -1.0950*HOUSE_AGE_YEARS + 50.6617.*

There is a somewhat weak and negative relationship between the age of the house and its price. And houses of zero age are predicted to have a *mean* price per unit area of 50.6617 x 10000 New Taiwan Dollar/Ping.

In a way, this is like asking how to become rich or how to reduce weight? As the saying goes, be careful what you ask for, because you just might get it!

The naive way to increase R² in an OLS linear regression model is to throw in more regression variables but this can also lead to an over-fitted model.

To see why adding regression variables to an OLS Regression model does not reduce R², consider two linear models fitted using the OLS technique:

*y_pred** = **β1*****X1** + **β0*

*y_pred** = **β2*****X2 **+ **β1*****X1** + **β0*

The OLS estimation technique minimizes the residual sum of squares (RSS). If the second model does not improve the value of R² over the first model, the OLS estimation technique will set ** β2 **to zero or to some value that is statistically insignificant which would essentially get us back to the first model. Generally speaking, each time you add a new regression variable and refit the model using OLS, you will either get a model with a better R² or essentially the same R² as the more constrained model.

This property of OLS estimation can work against you. If you go on adding more and more variables, the model will become increasingly unconstrained and the risk of over-fitting to your training data set will correspondingly increase.

On the other hand, the addition of correctly chosen variables will increase the goodness of fit of the model without increasing the risk of over-fitting to the training data.

This tussle between our desire to increase R² and the need to minimize over-fitting has led to the creation of another goodness-of-fit measure called the **Adjusted-R².**

The concept behind Adjusted-R² is simple. To get Adjusted-R², we penalize R² each time a new regression variable is added.

Specifically, we scale (1-R²) by a factor that is directly proportional to the number of regression variables. Greater is the number of regression variables in the model, greater is this scaling factor and greater is the downward adjustment to R².

The formula for Adjusted-R² is:

*df_mean_model *is the degrees of freedom of the mean model. For a training data set of size N, *df_mean_model=(N-1)*.

*df_model *is the degrees of freedom of the regression model. For a model with *p *regression variables, *df_model=(N-1-p)*.

Substituting:

One can see that as the model acquires more variables, *p *increases and the factor *(N-1)/(N-1-p) *increases which has the effect of depressing R².

## Drawbacks of Adjusted-R²

Adjusted-R² has some problems, notably:

- It treats the effect of all regression variables equally. In reality, some variables are more influential than others in their ability to make the model fit (or over-fit) the training data.
- The formula for Adjusted-R² yields negative values when R² falls below p/(N-1) thereby limiting the use of Adjusted-R² to only values of R² that are above p/(N-1).

We will illustrate the process of using Adjusted-R² using our example data set. To do so, let’s introduce another regression variable **NUM_CONVENIENCE_STORES_IN_AREA **and refit our OLS regression model on the data set:

y = df['HOUSE_PRICE_PER_UNIT_AREA']

X = df[['HOUSE_AGE_YEARS','NUM_CONVENIENCE_STORES_IN_AREA']]

X = sm.add_constant(X)olsr_model =OLS(endog=y, exog=X)

olsr_results = olsr_model.fit()

Let’s print the model training summary:

Notice that both R² and Adjusted-R² of the model with two regression variables is more than double that of the model with one variable:

On balance, the addition of the new regression variable has increased the goodness-of-fit. This conclusion is further supported by the coefficients by the p-values of the regression parameter coefficients. We see from the regression output that the p-values of all three coefficients in the 2-variable OLSR model are essentially zero indicating that all parameters are statistically significant:

The equation of the fitted two-variable model is as follows:

*HOUSE_PRICE_PER_UNIT_AREA_pred = -0.7709*HOUSE_AGE_YEARS + 2.6287*NUM_CONVENIENCE_STORES_IN_AREA + 36.9925*

Nonlinear models often use model fitting techniques such as **Maximum Likelihood Estimation (MLE)** which do not necessarily minimize the Residual Sum of Squares (RSS). Thus, given two nonlinear models that have been fitted using MLE, the one with the greater goodness-of-fit may turn out to have a lower R² or Adjusted-R². Another consequence of this fact is that adding regression variables to nonlinear models can reduce R². Overall, R² or Adjusted-R² should not be used for judging the goodness-of-fit of nonlinear regression model.

For nonlinear models, there have been a range of alternatives proposed for the humble R². We’ll look at one such alternative that is based on the following identity that we have come to know so well:

*Total Sum of Squares (TSS) = Residual Sum of Squares (RSS) + Explained Sum of Squares (ESS).*

While this identity works for OLS Linear Regression Models a.k.a. Linear Models, for nonlinear regression models, it turns out that a similar kind of *triangle* identity works using the concept of **Deviance**. We’ll explain the concept of Deviance in a bit but for now, let’s look at this identity for nonlinear regression models:

*Deviance of the Intercept-only model = Deviance of the fitted nonlinear model + Deviance explained by the fitted nonlinear model.*

Notation-wise:

Where:

*D(**y**, **y_mean**) = Deviance of the Intercept-only model*

*D(**y**, **y_pred**) = Deviance of the fitted nonlinear model*

*D(**y_pred**, **y_mean**) = Deviance explained by the fitted nonlinear model*

Using the above identity, Cameron and Windmeijer have described (see paper links at the end of article) the following Deviance based formula for R² that is applicable to nonlinear models, especially for Generalized Linear Regression Models (known as GLMs) that are fitted on discrete data. Commonly occurring examples of such nonlinear models are the Poisson and Generalized Poisson models, the Negative Binomial Regression model and the Logistic Regression Model:

Before we go any further, some terms deserve an explanation.

## Deviance

**Deviance **of a regression model measures by how much the **Log-Likelihood **(more about Log-Likelihood in a bit) of the fitted regression model is greater than the Log-Likelihood of the **saturated model**. Specifically,

So this begs two questions: What is a **saturated model** and what is **Likelihood**?

## Saturated Model

A **saturated regression model** is one in which the number of regression variables is equal to the number of *unique y values* in the sample data set. What a saturated model gives you is essentially N equations in N variables, and we know from college algebra that a system of N equations in N variables yields an exact solution for each variable. Thus, a saturated model can be built to perfectly fit each ** y** value. A saturated model thereby yields the maximum possible fit on your training data set.

## Likelihood

Now let’s tackle **Likelihood**. The Likelihood of a fitted regression model is the probability (or probability density) of jointly observing all ** y **values in the training data set using the predictions of the fitted model as the mean parameter of the probability distribution function. The procedure for calculating Likelihood is as follows:

- Say your training data set contains 100
observations. What you want to calculate is the*y***joint**probability of observing*y1*and*y2*and*y3*and…up to*y100*with your fitted regression model. - So you start with fitting your regression model on this training data.
- Next, you feed the 100 rows in your training set through the fitted model to get 100 predictions from this model. These 100
*y_pred_i*values are your 100 conditional means (the*ith*mean is conditioned upon the corresponding*ith*row in yourmatrix).*X* - Now you set the 100 observed
**y**values and the 100 conditional means (the predictions) in the probability (density) function ofto get 100 probability values. One probability for each*y**y_i*in.*y* - Finally, you multiply together these 100 probabilities to get the Likelihood value. This is the likelihood of observing your training data set given your fitted model.

To get a feel for the calculation, I’d encourage you to refer to the following article. It contains a sample calculation for Likelihood for a Poisson Model:

The **Log-Likelihood** is simply the natural logarithm of the Likelihood of the fitted model.

With these concepts under our belt, let’s circle back to our Deviance based formula for Pseudo-R²:

As mentioned earlier:

*D(**y**, **y_pred**) = Deviance of the fitted nonlinear model*

*D(**y**, **y_mean**) = Deviance of the Intercept-only model *(a.k.a. the null-model). The null model contains only the intercept i.e. no regression variables.

Using our formula for Deviance:

And,

Therefore:

Sometimes, the following **simpler version of Pseudo- R²** proposed by McFadden is used (see paper link below for details):

McFadden’s Pseudo-R² is implemented by the Python statsmodels library for discrete data models such as Poisson or NegativeBinomial or the Logistic (Logit) regression model. If you call `DiscreteResults.prsquared()`

, you will get the value of McFadden’s R-squared value on your fitted nonlinear regression model.

See my tutorials on Poisson and NegativeBinomial Regression models on how to fit such types of nonlinear models on discrete (counts) based data sets:

Also check out:

**Dataset**

Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260–271.

## Paper and Book Links

Cameron A. Colin, Trivedi Pravin K., *Regression Analysis of Count Data*, Econometric Society Monograph №30, Cambridge University Press, 1998. ISBN: 0521635675

McCullagh P., Nelder John A., *Generalized Linear Models*, 2nd Ed., CRC Press, 1989, ISBN 0412317605, 9780412317606

McFadden, D. (1974), *Conditional logit analysis of qualitative choice behaviour*, in: P. Zarembka (ed.), Frontiers in Econometrics, Academic Press, New York, 105–142. **PDF Download Link**

Cameron, A., & Frank A. G. Windmeijer. (1996). *R-Squared Measures for Count Data Regression Models with Applications to Health-Care Utilization*. Journal of Business & Economic Statistics, *14*(2), 209–220. doi:10.2307/1392433 **PDF Download Link**

A. Colin Cameron, Frank A.G. Windmeijer, *An R-squared measure of goodness of fit for some common nonlinear regression models*, Journal of Econometrics, Volume 77, Issue 2, 1997, Pages 329–342, ISSN 0304–4076,

https://doi.org/10.1016/S0304-4076(96)01818-0. **PDF Download link**

N. J. D. Nagelkerke, *A note on a general definition of the coefficient of determination*, Biometrika, Volume 78, Issue 3, September 1991, Pages 691–692, https://doi.org/10.1093/biomet/78.3.691. **PDF Download link**

## Images

All images in this article are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

*Thanks for reading! If you liked this article, please **follow me** to receive tips, how-tos and programming advice on regression and time series analysis.*