Why is polynomial regression linear?

1. The problem statement etc.:

Consider the polynomial regression equation:

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \beta_4 x_1^2 + \beta_5 x_2^2 + \epsilon    [Eq. I]

where it is understood that y, x_1, x_2 and \epsilon actually are “vector”s, i.e., there is a separate column for each of them. A given dataset contains a large number of rows; each row has some value of y, x_1, and x_2 given in it.

Following the “standard” usage in statistics:

  • y is called the
    • “response” variable or
    • “outcome” variable;
  • \beta_i are called:
    • “parameters” or
    • “effects” or
    • “(estimated) regression coefficients”;
  • x_1, x_2, etc. are called:
    • “predictor” variables, or
    • “explanatory” variables, or
    • “controlled” variables, or
    • also as “factors”;
  • and \epsilon is called:
    • “errors”, or
    • “disturbance term”, or
    • “residuals” (or “residual errors”)
    • “noise”, etc.

I hate statisticians.

For simplicity, I will call y dependent variable, x‘s independent variables, and \epsilon errors. The \beta‘s will be called coefficients.

Consider an Excel spreadsheet having one column for y and possibly many columns for x‘s. There are N number of rows.

Let’s use the index i for rows, and j for columns. Accordingly, y_i is the value of the dependent variable in the ith row; x_{ij} are the values of the independent variables in the same row; \beta_m‘s are the undetermined coefficients for the dataset as a whole; and \epsilon_i is the error of the ith row when some model is assumed.

Notice, here I refer to the coefficients \beta_ms using an index other than i or j. This is deliberate. The index for \beta‘s  runs from 0 to m. In general, the number of coefficients can be greater than the number of independent variables, and in general, it has no relation to how many rows there are in the dataset.

Obviously, a model which is as as “general” as above sure comes with products of x_j terms or their squares (or, in a bigger model, with as many higher-order terms as you like).

Obviously, therefore, if you ask most any engineer, he will say that Eq. I is an instance of a non-linear regression.

However, statisticians disagree. They say that the particular regression model mentioned previously is linear.

They won’t clarify on their own, but if you probe them, they will supply the following as a simple example of a nonlinear model. (Typically, what they would supply would be far more complicated model, but trust me, the following simplest example too satisfies their requirements of a nonlinear regression.)

y = \beta_0 + \beta_1^2 x + \epsilon            [Eq II].

(And, yes, you could have raised \beta_0 to a higher order too.)

I hate statisticians. They mint at least M number of words for the same concept (with M \geq 5), with each word having been very carefully chosen in such a way that it minimizes all your chances of understanding the exact nature of that concept. Naturally, I hate them.

Further, I also sometimes hate them because they don’t usually tell you, right in your first course, some broad but simple examples of nonlinear regression, right at the same time when they introduce you to the topic of linear regression.

Finally, I also hate them because they never give adequate enough an explanation as to why they call the linear regression “linear”. Or, for that matter, why they just can’t get together and standardize on terminology.

Since I hate statisticians so much, but since the things they do also are mentioned in practical things like Data Science, I also try to understand why they do what they do.

In this post, I will jot down the reason the reason behind their saying that Eq. I. is linear regression, but Eq. II is nonlinear regression. I will touch upon several points of the context, but in a somewhat arbitrary order. (This is a very informally written, and very lengthy a post. It also comes with homework.)

2. What is the broadest purpose of regression analysis?:

The purpose of regression analysis is to give you a model—an equation—which you could use so as to predict y from a given tuple (x_1, x_2, \dots). The prediction should be as close as possible.

3. When do you use regression?

You use regression only when a given system is overdetermined. Mark this point well. Make sure to understand it.

A system of linear algebraic equations is said to be “consistent” when there are as many independent equations as there are unknown variables. For instance, a system like:
2 x + 7y = 10
9 x + 5 y = -12
is consistent.

There are two equations, and two unknowns x and y. You can use any direct method such as Kramer’s method, Gaussian elimination, matrix-factorization, or even matrix-inversion (!), and get to know the unknown variables. When the number of unknowns i.e. the number of equations is large (FEM/CFD can easily produce systems of millions of equations in equal number of uknowns), you can use iterative approaches like SOR etc.

When the number of equations is smaller than the number of unkowns, the system is under-determined. You can’t solve such a system and get to a unique value by way of a solution.

When the number of equations is greater than the number of unknowns, the system is over-determined.

I am over-determined to see to it that I don’t explain you everything about this topic of under- and over-determined systems. Go look up on the ‘net. I will only mention that, in my knowledge (and I could be wrong, but it does seem that):

Regression becomes really relevant only for the over-determined systems.

You could use regression even in the large consistent systems, but there, it would become indistinguishable from the iterative approaches to solutions. You could also use regression for the under-determined systems (and you can anyway use the least squares for them). But I am not sure if you would want to use specifically regression here. In any case…

Data Science is full of problems where systems of equations are over-determined.

Every time you run into an Excel spreadsheet that has more number of rows than columns, you have an over-determined system.

That’s such systems are not consistent—it has no unique solution. That’s why regression comes in handy; it becomes important. If all systems were to be consistent, people would be happy using deterministic solutions; not regression.

4. How do you use a regression model?

4.1 The \hat{y} function as the placeholder:

In regression, we first propose (i.e. assume) a mathematical model, which can be broadly stated as:

y = \hat{y} + \epsilon,

where y are the row-wise given data values, and \hat{y} are their respective estimated values. Thus, \hat{y} is a function of the given dataset values x_j‘s; it stands for the value of y as estimated by the regression model. The error term gives you the differences from the actual vs. predicted values—row-wise.

4.2 An assumed \hat{y} function:

But while y_i‘s and x_{ij}‘s are the given values, the function \hat{y} is what we first have to assume.

Accordingly, we may say, perhaps, that:

\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \beta_4 x_1^2 + \beta_5 x_2^2                   ……….[Eq. 1]

or, alternatively, that:

\hat{y} = \beta_0 + \beta_1^2 x_1                   ……….[Eq. 2]

These two equations would give you two different estimates \hat{y}s for the “true” (and unknown) y‘s, given the same dataset (the same spreadsheet).

Now, notice a few things about the form of the mathematical equation being assumed—for a regression model.

4.3 \beta_m‘s refer to the entire data-table taken as a whole—not row-wise:

In any given regression model, \beta_m coefficients do not change their values from one row to another row. They remain the same for all the rows of a given, concrete, dataset. It’s only the values of x_1, x_2 and y that change from row to row. The values of \beta_0, \beta_1, \dots remain the same for all the rows of the dataset taken together.

4.4 Applying a regressed model to new data works with one row at a time:

In using a regression model, we assume a model equation (Eq. 1 or Eq. 2), then plug in a new tuple of the known values of the independent variables (x_1, x_2, \dots) values into it, and voila! Out comes the \hat{y}—as predicted by that model. We treat \hat{y} as the best estimate for some true but unknown y for that specific combination of (x_1, x_2, \dots) values.

Here, realize, for making predictions, just one row of the test dataset is enough—provided the values of \beta_m‘s are already known.

In contrast, to build a regression model, we need all the N number of rows of the training dataset, where N can be a very large number.

By the time you come to use a regression model, somehow, you have already determined the initially “undetermined” coefficients \beta_m‘s. They have already become known to you. That’s why, all that you need to do now is to just plug in some new values of x_j‘s, and get the prediction \hat{y} out.

This fact provides us with the first important clue:

The use of regression treats \beta_m coefficients as the known values that remain the same for all data-rows. That is, the use of regression treats \beta_m values as constants!

But how do you get to know what values to use for \beta_ms? How do you get to a regressed model in the first place?

5. How do you conduct regression?—i.e., how do you build the regression model?

In regression analysis, you first assume a model (say, Eq. 1 or Eq. 2 above). You then try to determine the values of \beta_0 and \beta_1 using some method like the least squares and gradient descent.

I will not go here into why you use the method of the least squares. In briefest: for linear systems, the cost function surface of the sum of the squared errors turns out to be convex (which is not true for the logistic regression model), and the least squares method gives you the demonstrably “best” estimate for the coefficients. Check out at least this much, here [^]. If interested, check out the Gauss-Markov theorem it mentions on the ‘net.

I also ask you to go, teach yourself, the gradient descent method. The best way to do that is by “stealing” a simple-to-understand Python code. Now, chuck aside the Jupyter notebook, and instead use an IDE, and debug-step through the code. Add at least two print statements for every Python statement. (That’s an order!)

Once done with it, pick up the mathematical literature. You will now find that you can go through the maths as easily as a hot knife moves through a piece of an already warm butter. [I will try to post a simple gradient descent code here in the next post, but no promises.]

So, that’s how you get to determine the (approximate but workable) values of \beta_m coefficients. You use the iterative technique of gradient descent, treating regression as a kind of an optimization problem. [Don’t worry if you don’t know the latter.]

“This is getting confusing. Tell me the real story” you say?

Ok. Vaguely speaking, the real story behind the gradient descent and the least squares goes something like in the next section.

6. How the GD works—a very high-level view:

To recap: Using a regression model treats the \beta_m coefficients as constants. Reaching (or building) a regression model from a dataset treats the \beta_m coefficients as variables. (This is regardless of whether you use gradient descent or any other method.)

In the gradient descent algorithm (GD), you choose a model equation like Eq. 1 or Eq. 2.

Then, you begin with some arbitrarily (but reasonably!) chosen \beta_m values as the initial guess.

Then, for each row of the dataset, you substitute the guessed \beta_m in the chosen model equation. Since y, x_1, and x_2 are known for each row of the dataset, you can find the error for each row—if these merely guessed \beta_m values were to be treated as being an actual solution for all of them.

Since the merely guessed values of \beta_m aren’t in general the same as the actual solution, their substitution on the right hand-side of the model (Eq. 1 or Eq. 2) doesn’t reproduce the left hand-side. Therefore, there are errors:

\epsilon = y - \hat{y}

Note, there is a separate error value for each row of the dataset.

You then want to know some single number which tells you how bad your initial guess of \beta_m values was, when this guess was applied to every row in the dataset as such. (Remember, \beta_m always remain the same for the entire dataset; they don’t change from row to row.) In other words, you want a measure for the ``total error” for the dataset.

Since the row-wise errors can be both positive and negative, if you simply sum them up, they might partially or fully cancel each other out. We therefore need some positive-valued measures for the row-wise errors, before we add all of them together. One simple (and for many mathematical reason, a very good) choice to turn both positive and negative row-wise errors into an always positive measure is to take squares of the individual row-wise errors, and then add them up together, so as to get to the sum of the squared errors, i.e., the total error for the entire dataset.

The dataset error measure typically adopted then is the average of the squares of the row-wise errors.

In this way, you come to assume a cost function for the total dataset. A cost function is a function of the \beta_m values. If there are two \beta‘s in your model, say \beta_0 and \beta_1, then the cost function C would vary as either or both the coefficient values were varied. In general, it would form a cost function surface constructed in reference to a parameters-space.

You then calculate an appropriate correction to be applied to the initially guessed values of \beta_ms, so as to improve your initial guess. This correction is typically taken as being directly proportional to the current value of the cost function for the given dataset.

In short, you take a guess for \beta_m‘s, find the current value of the total cost function C_n for the dataset (which is obtained by substituting the guessed \beta_m values into the assumed model equation). Then, you apply the correction to the \beta_m values, and thereby update them.

You then repeat this procedure, improving your guess during each iteration.

That’s what every one says about the gradient descent. Now, something which they don’t point out, but is important for our purposes:

Note that in any such a processing (using GD or any other iterative technique), you essentially are treating the \beta_m terms as variables, not as constants. In other words:

The process of regression—i.e., the process of model-building (as in contrast to using an already built model)—treats the unknown coefficients \beta_m‘s as variables, not constants.

The variable values of \beta_m‘s values then iteratively converge towards some stable set of values which, within a tolerance, you can accept as being the final solution.

The values of \beta_m‘s so obtained are, thus, regarded as being essentially constant (within some specified tolerance band).

7. How statisticians look at the word “regression”:

Statisticians take the word “regression” primarily in the sense of the process of building a regression model.

Taken in this sense, the word “regression” is not a process of using the final model. It is not a process of taking some value on x-axis, locating the point on the model curve, and reading out the \hat{y} from it.

To statisticians,

Regression is, primarily, a process of regressing from a set of specific (y, x1, x2, \dots) tuples in a given dataset, to the corresponding estimated mean values \hat{y}s.

Thus, statisticians regard regression as the shifting, turning, and twisting of a tentatively picked up curve of \hat{y} vs. x_1, so that it comes to fit the totality of the dataset in some best possible sense. [With more than one parameter or coefficient, it’s a cost function surface.]

To repeat: The process of regressing from the randomly distributed and concretely given y values to the regressed mean values \hat{y}, while following an assumed function-al form for the model, necessarily treats \beta_m‘s as variables.

Thus, \hat{y}, and therefore C (the cost function used in GD) is a function not just of x_{ij}‘s but also of \beta_m‘s.

So, the regression process is not just:

\hat{y} = f( x_1, x_2, \dots).

It actually is:

\hat{y} = f( x_1, x_2, \dots; \beta_0, \beta_1, \beta_2, \dots).

Now, since (x_1, x_2, x_3,\dots) remain constants, it is \beta_m‘s which truly become the king—it is they which truly determine the row-wise \hat{y}‘s, and hence, the total cost function C = C( x_1, x_2, \dots; \beta_0, \beta_1, \beta_2, \dots).

So, on to the most crucial observation:

Since \hat{y} is a function of \beta_ms, and since \beta_ms are regarded as changeable, the algorithm’s behaviour depends on what kind of a function-al dependency \hat{y} has, in the assumed model, on the \beta_m coefficients.

In regression with Eq. 1, \hat{y} is a linear function of \beta_m‘s. Hence, the evolution of the \beta_m values during a run of the algorithm would be a linear evolution. Hence this regression—this process of updating \beta_m values—is linear in nature.

However, in Eq. 2, \beta_m‘s evolve quadratically. Hence, it is a non-linear regression.


The end-result of a polynomial regression is a mathematical function which, in general, is nonlinear in {x_j}‘s.

However, the process of regression—the evolution—itself is linear in \beta_m‘s for Eq. 1

Further, no regression algorithm or process ever changes any of the given (x_1, x_2, \dots) or y values.

Statisticians therefore say that the polynomial regression is a subclass of linear regression—even if it is a polynomial in x_j‘s.

8. Homework:

Homework 1:

Argue back with the statisticians.

Tell them that in using Eq. 1 above, the regression process does evolve linearly, but each evolution occurs with respect to another x-axis, say a x'-axis, where x' has been scaled quadratically with respect to the original x. It’s only after this quadratic scaling (or mapping) that we can at all can get a straight line in the mapped space—not in the original space.

Hence, argue that Eq. 1 must be regarded as something like “semi-linear” or “linear-quadratic” regression.

Just argue, and see what they say. Do it. Even if only for fun.

I anticipate that they will not give you a direct answer. They will keep it vague. The reason is this:

Statisticians are, basically, mathematicians. They would always love to base their most basic definitions not in the epistemologically lower-level fact or abstractions, but in as higher-level abstractions as is possible for them to do. (Basically they all are Platonists at heart, in short—whether explicitly or implicitly).

That’s why, when you argue with them in the preceding manner, they will simply come to spread a large vague smile on their own face, and point out to you the irrefutable fact that the cost-function surface makes a direct reference only to the parameters-space (i.e. a space spanned by the variations in \beta_m‘s). The smile would be vague, because they do see your point, but even if they do, they also know that answering this way, they would succeed in having silenced you. Such an outcome, in their rule-book of the intellectuals, stands for victory—theirs. It makes them feel superior. That’s what they really are longing for.

It’s my statistical prediction that most statisticians would answer thusly with you. (Also most any Indian “intellectual”. [Indians are highly casteist a people—statistically speaking. No, don’t go by my word for it; go ask any psephologist worth his salt. So, the word should be: caste-intellectuals. But I look for the “outliers” here, and so, I don’t use that term—scare-quotes are enough.])

In the case this most probable event occurs, just leave them alone, come back, and enjoy some songs. …

Later on, who knows, you might come to realize that even if Platonism diminishes their discipline and personae, what they have developed as a profession and given you despite their Platonism, had enough of good elements which you could use practically. (Not because Platonism gives good things, but because this field is science, ultimately.)

…Why, in a more charitable mood, you might even want to thank them for having done that—for having given you some useful tools, even if they never clarified all their crucial concepts well enough to you. They could not have—given their Platonism and/or mysticism. So, when you are in a sufficiently good mood, just thank them, and leave them alone. … “c’est la vie…”

Homework 2:

Write a brief (one page, 200–300 words) summary for this post. Include all the essentials. Then, if possible, also make a flash card or two out of it. For neat examples, check out Chris Albon’s cards [^].

A song I like:

(Hindi) “gaataa rahe meraa dil…”
Music: S. D. Burman (with R.D. and team’s assistance)
Singers: Kishore Kumar, Lata Mangeshkar
Lyrics: Shailendra

— First Published: 2019.11.29 10:52 IST.

PS: Typos to be corrected later today.