Equations using the matrix notation for a simple artificial neural network

Update on 2020.01.30 16:58 IST:

I have taken the document offline. Yes, it was too incomplete, tentative, and in fact also had errors. The biggest and most obvious error was about the error vector. 🙂

No, no one pointed any of the errors or flaws to me.

Yes, I will post an expanded and revised version later, hopefully in the first week of February. (I started work on this document on 24th Jan., but also was looking into other issues.) When I am done, I will delete this entire post, and make a new entry to announce the availability of the corrected, expanded and revised document.

The original post appears below.


Go, read [^].

Let me know about the typo’s. Also, errors. [Though I don’t expect you to do that. [I will eat my estimate of your moral character, on this count, at least for the time being.]]

I can, and might, take it out of the published domain, any time I want. [Yes, I am irresponsible, careless, unreliable, etc. [Also, “imposing” type. Without “team-spirit”. One who looks down on his colleagues as being way, way, inferior to me.]]

I will also improve on it—I mean the document. In fact, I even intend to expand it, with some brief notes to be added on various activation- and loss-functions.

Eventually, I may even publish it at GitHub. Or, at arXiv. [If they let me do that. But then, another consideration: there are physicists there!]

 


A song I like:

(Hindi) “कभी तो मिलेगी” (kabhee to milegi)
Singer: Lata
Music: Roshan
Lyrics: Majrooh Sultanpuri

[Credits happily listed in a mostly random order. [The issue was only with the last two; it was clear which one had to appear first.]

 

 

Why is polynomial regression linear?

1. The problem statement etc.:

Consider the polynomial regression equation:

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \beta_4 x_1^2 + \beta_5 x_2^2 + \epsilon    [Eq. I]

where it is understood that y, x_1, x_2 and \epsilon actually are “vector”s, i.e., there is a separate column for each of them. A given dataset contains a large number of rows; each row has some value of y, x_1, and x_2 given in it.

Following the “standard” usage in statistics:

  • y is called the
    • “response” variable or
    • “outcome” variable;
  • \beta_i are called:
    • “parameters” or
    • “effects” or
    • “(estimated) regression coefficients”;
  • x_1, x_2, etc. are called:
    • “predictor” variables, or
    • “explanatory” variables, or
    • “controlled” variables, or
    • also as “factors”;
  • and \epsilon is called:
    • “errors”, or
    • “disturbance term”, or
    • “residuals” (or “residual errors”)
    • “noise”, etc.

I hate statisticians.

For simplicity, I will call y dependent variable, x‘s independent variables, and \epsilon errors. The \beta‘s will be called coefficients.

Consider an Excel spreadsheet having one column for y and possibly many columns for x‘s. There are N number of rows.

Let’s use the index i for rows, and j for columns. Accordingly, y_i is the value of the dependent variable in the ith row; x_{ij} are the values of the independent variables in the same row; \beta_m‘s are the undetermined coefficients for the dataset as a whole; and \epsilon_i is the error of the ith row when some model is assumed.

Notice, here I refer to the coefficients \beta_ms using an index other than i or j. This is deliberate. The index for \beta‘s  runs from 0 to m. In general, the number of coefficients can be greater than the number of independent variables, and in general, it has no relation to how many rows there are in the dataset.

Obviously, a model which is as as “general” as above sure comes with products of x_j terms or their squares (or, in a bigger model, with as many higher-order terms as you like).

Obviously, therefore, if you ask most any engineer, he will say that Eq. I is an instance of a non-linear regression.

However, statisticians disagree. They say that the particular regression model mentioned previously is linear.

They won’t clarify on their own, but if you probe them, they will supply the following as a simple example of a nonlinear model. (Typically, what they would supply would be far more complicated model, but trust me, the following simplest example too satisfies their requirements of a nonlinear regression.)

y = \beta_0 + \beta_1^2 x + \epsilon            [Eq II].

(And, yes, you could have raised \beta_0 to a higher order too.)

I hate statisticians. They mint at least M number of words for the same concept (with M \geq 5), with each word having been very carefully chosen in such a way that it minimizes all your chances of understanding the exact nature of that concept. Naturally, I hate them.

Further, I also sometimes hate them because they don’t usually tell you, right in your first course, some broad but simple examples of nonlinear regression, right at the same time when they introduce you to the topic of linear regression.

Finally, I also hate them because they never give adequate enough an explanation as to why they call the linear regression “linear”. Or, for that matter, why they just can’t get together and standardize on terminology.

Since I hate statisticians so much, but since the things they do also are mentioned in practical things like Data Science, I also try to understand why they do what they do.

In this post, I will jot down the reason the reason behind their saying that Eq. I. is linear regression, but Eq. II is nonlinear regression. I will touch upon several points of the context, but in a somewhat arbitrary order. (This is a very informally written, and very lengthy a post. It also comes with homework.)


2. What is the broadest purpose of regression analysis?:

The purpose of regression analysis is to give you a model—an equation—which you could use so as to predict y from a given tuple (x_1, x_2, \dots). The prediction should be as close as possible.


3. When do you use regression?

You use regression only when a given system is overdetermined. Mark this point well. Make sure to understand it.

A system of linear algebraic equations is said to be “consistent” when there are as many independent equations as there are unknown variables. For instance, a system like:
2 x + 7y = 10
9 x + 5 y = -12
is consistent.

There are two equations, and two unknowns x and y. You can use any direct method such as Kramer’s method, Gaussian elimination, matrix-factorization, or even matrix-inversion (!), and get to know the unknown variables. When the number of unknowns i.e. the number of equations is large (FEM/CFD can easily produce systems of millions of equations in equal number of uknowns), you can use iterative approaches like SOR etc.

When the number of equations is smaller than the number of unkowns, the system is under-determined. You can’t solve such a system and get to a unique value by way of a solution.

When the number of equations is greater than the number of unknowns, the system is over-determined.

I am over-determined to see to it that I don’t explain you everything about this topic of under- and over-determined systems. Go look up on the ‘net. I will only mention that, in my knowledge (and I could be wrong, but it does seem that):

Regression becomes really relevant only for the over-determined systems.

You could use regression even in the large consistent systems, but there, it would become indistinguishable from the iterative approaches to solutions. You could also use regression for the under-determined systems (and you can anyway use the least squares for them). But I am not sure if you would want to use specifically regression here. In any case…

Data Science is full of problems where systems of equations are over-determined.

Every time you run into an Excel spreadsheet that has more number of rows than columns, you have an over-determined system.

That’s such systems are not consistent—it has no unique solution. That’s why regression comes in handy; it becomes important. If all systems were to be consistent, people would be happy using deterministic solutions; not regression.


4. How do you use a regression model?

4.1 The \hat{y} function as the placeholder:

In regression, we first propose (i.e. assume) a mathematical model, which can be broadly stated as:

y = \hat{y} + \epsilon,

where y are the row-wise given data values, and \hat{y} are their respective estimated values. Thus, \hat{y} is a function of the given dataset values x_j‘s; it stands for the value of y as estimated by the regression model. The error term gives you the differences from the actual vs. predicted values—row-wise.

4.2 An assumed \hat{y} function:

But while y_i‘s and x_{ij}‘s are the given values, the function \hat{y} is what we first have to assume.

Accordingly, we may say, perhaps, that:

\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \beta_4 x_1^2 + \beta_5 x_2^2                   ……….[Eq. 1]

or, alternatively, that:

\hat{y} = \beta_0 + \beta_1^2 x_1                   ……….[Eq. 2]

These two equations would give you two different estimates \hat{y}s for the “true” (and unknown) y‘s, given the same dataset (the same spreadsheet).

Now, notice a few things about the form of the mathematical equation being assumed—for a regression model.

4.3 \beta_m‘s refer to the entire data-table taken as a whole—not row-wise:

In any given regression model, \beta_m coefficients do not change their values from one row to another row. They remain the same for all the rows of a given, concrete, dataset. It’s only the values of x_1, x_2 and y that change from row to row. The values of \beta_0, \beta_1, \dots remain the same for all the rows of the dataset taken together.

4.4 Applying a regressed model to new data works with one row at a time:

In using a regression model, we assume a model equation (Eq. 1 or Eq. 2), then plug in a new tuple of the known values of the independent variables (x_1, x_2, \dots) values into it, and voila! Out comes the \hat{y}—as predicted by that model. We treat \hat{y} as the best estimate for some true but unknown y for that specific combination of (x_1, x_2, \dots) values.

Here, realize, for making predictions, just one row of the test dataset is enough—provided the values of \beta_m‘s are already known.

In contrast, to build a regression model, we need all the N number of rows of the training dataset, where N can be a very large number.

By the time you come to use a regression model, somehow, you have already determined the initially “undetermined” coefficients \beta_m‘s. They have already become known to you. That’s why, all that you need to do now is to just plug in some new values of x_j‘s, and get the prediction \hat{y} out.

This fact provides us with the first important clue:

The use of regression treats \beta_m coefficients as the known values that remain the same for all data-rows. That is, the use of regression treats \beta_m values as constants!

But how do you get to know what values to use for \beta_ms? How do you get to a regressed model in the first place?


5. How do you conduct regression?—i.e., how do you build the regression model?

In regression analysis, you first assume a model (say, Eq. 1 or Eq. 2 above). You then try to determine the values of \beta_0 and \beta_1 using some method like the least squares and gradient descent.

I will not go here into why you use the method of the least squares. In briefest: for linear systems, the cost function surface of the sum of the squared errors turns out to be convex (which is not true for the logistic regression model), and the least squares method gives you the demonstrably “best” estimate for the coefficients. Check out at least this much, here [^]. If interested, check out the Gauss-Markov theorem it mentions on the ‘net.

I also ask you to go, teach yourself, the gradient descent method. The best way to do that is by “stealing” a simple-to-understand Python code. Now, chuck aside the Jupyter notebook, and instead use an IDE, and debug-step through the code. Add at least two print statements for every Python statement. (That’s an order!)

Once done with it, pick up the mathematical literature. You will now find that you can go through the maths as easily as a hot knife moves through a piece of an already warm butter. [I will try to post a simple gradient descent code here in the next post, but no promises.]

So, that’s how you get to determine the (approximate but workable) values of \beta_m coefficients. You use the iterative technique of gradient descent, treating regression as a kind of an optimization problem. [Don’t worry if you don’t know the latter.]

“This is getting confusing. Tell me the real story” you say?

Ok. Vaguely speaking, the real story behind the gradient descent and the least squares goes something like in the next section.


6. How the GD works—a very high-level view:

To recap: Using a regression model treats the \beta_m coefficients as constants. Reaching (or building) a regression model from a dataset treats the \beta_m coefficients as variables. (This is regardless of whether you use gradient descent or any other method.)

In the gradient descent algorithm (GD), you choose a model equation like Eq. 1 or Eq. 2.

Then, you begin with some arbitrarily (but reasonably!) chosen \beta_m values as the initial guess.

Then, for each row of the dataset, you substitute the guessed \beta_m in the chosen model equation. Since y, x_1, and x_2 are known for each row of the dataset, you can find the error for each row—if these merely guessed \beta_m values were to be treated as being an actual solution for all of them.

Since the merely guessed values of \beta_m aren’t in general the same as the actual solution, their substitution on the right hand-side of the model (Eq. 1 or Eq. 2) doesn’t reproduce the left hand-side. Therefore, there are errors:

\epsilon = y - \hat{y}

Note, there is a separate error value for each row of the dataset.

You then want to know some single number which tells you how bad your initial guess of \beta_m values was, when this guess was applied to every row in the dataset as such. (Remember, \beta_m always remain the same for the entire dataset; they don’t change from row to row.) In other words, you want a measure for the ``total error” for the dataset.

Since the row-wise errors can be both positive and negative, if you simply sum them up, they might partially or fully cancel each other out. We therefore need some positive-valued measures for the row-wise errors, before we add all of them together. One simple (and for many mathematical reason, a very good) choice to turn both positive and negative row-wise errors into an always positive measure is to take squares of the individual row-wise errors, and then add them up together, so as to get to the sum of the squared errors, i.e., the total error for the entire dataset.

The dataset error measure typically adopted then is the average of the squares of the row-wise errors.

In this way, you come to assume a cost function for the total dataset. A cost function is a function of the \beta_m values. If there are two \beta‘s in your model, say \beta_0 and \beta_1, then the cost function C would vary as either or both the coefficient values were varied. In general, it would form a cost function surface constructed in reference to a parameters-space.

You then calculate an appropriate correction to be applied to the initially guessed values of \beta_ms, so as to improve your initial guess. This correction is typically taken as being directly proportional to the current value of the cost function for the given dataset.

In short, you take a guess for \beta_m‘s, find the current value of the total cost function C_n for the dataset (which is obtained by substituting the guessed \beta_m values into the assumed model equation). Then, you apply the correction to the \beta_m values, and thereby update them.

You then repeat this procedure, improving your guess during each iteration.

That’s what every one says about the gradient descent. Now, something which they don’t point out, but is important for our purposes:

Note that in any such a processing (using GD or any other iterative technique), you essentially are treating the \beta_m terms as variables, not as constants. In other words:

The process of regression—i.e., the process of model-building (as in contrast to using an already built model)—treats the unknown coefficients \beta_m‘s as variables, not constants.

The variable values of \beta_m‘s values then iteratively converge towards some stable set of values which, within a tolerance, you can accept as being the final solution.

The values of \beta_m‘s so obtained are, thus, regarded as being essentially constant (within some specified tolerance band).


7. How statisticians look at the word “regression”:

Statisticians take the word “regression” primarily in the sense of the process of building a regression model.

Taken in this sense, the word “regression” is not a process of using the final model. It is not a process of taking some value on x-axis, locating the point on the model curve, and reading out the \hat{y} from it.

To statisticians,

Regression is, primarily, a process of regressing from a set of specific (y, x1, x2, \dots) tuples in a given dataset, to the corresponding estimated mean values \hat{y}s.

Thus, statisticians regard regression as the shifting, turning, and twisting of a tentatively picked up curve of \hat{y} vs. x_1, so that it comes to fit the totality of the dataset in some best possible sense. [With more than one parameter or coefficient, it’s a cost function surface.]

To repeat: The process of regressing from the randomly distributed and concretely given y values to the regressed mean values \hat{y}, while following an assumed function-al form for the model, necessarily treats \beta_m‘s as variables.

Thus, \hat{y}, and therefore C (the cost function used in GD) is a function not just of x_{ij}‘s but also of \beta_m‘s.

So, the regression process is not just:

\hat{y} = f( x_1, x_2, \dots).

It actually is:

\hat{y} = f( x_1, x_2, \dots; \beta_0, \beta_1, \beta_2, \dots).

Now, since (x_1, x_2, x_3,\dots) remain constants, it is \beta_m‘s which truly become the king—it is they which truly determine the row-wise \hat{y}‘s, and hence, the total cost function C = C( x_1, x_2, \dots; \beta_0, \beta_1, \beta_2, \dots).

So, on to the most crucial observation:

Since \hat{y} is a function of \beta_ms, and since \beta_ms are regarded as changeable, the algorithm’s behaviour depends on what kind of a function-al dependency \hat{y} has, in the assumed model, on the \beta_m coefficients.

In regression with Eq. 1, \hat{y} is a linear function of \beta_m‘s. Hence, the evolution of the \beta_m values during a run of the algorithm would be a linear evolution. Hence this regression—this process of updating \beta_m values—is linear in nature.

However, in Eq. 2, \beta_m‘s evolve quadratically. Hence, it is a non-linear regression.

Overall,

The end-result of a polynomial regression is a mathematical function which, in general, is nonlinear in {x_j}‘s.

However, the process of regression—the evolution—itself is linear in \beta_m‘s for Eq. 1

Further, no regression algorithm or process ever changes any of the given (x_1, x_2, \dots) or y values.

Statisticians therefore say that the polynomial regression is a subclass of linear regression—even if it is a polynomial in x_j‘s.


8. Homework:

Homework 1:

Argue back with the statisticians.

Tell them that in using Eq. 1 above, the regression process does evolve linearly, but each evolution occurs with respect to another x-axis, say a x'-axis, where x' has been scaled quadratically with respect to the original x. It’s only after this quadratic scaling (or mapping) that we can at all can get a straight line in the mapped space—not in the original space.

Hence, argue that Eq. 1 must be regarded as something like “semi-linear” or “linear-quadratic” regression.

Just argue, and see what they say. Do it. Even if only for fun.

I anticipate that they will not give you a direct answer. They will keep it vague. The reason is this:

Statisticians are, basically, mathematicians. They would always love to base their most basic definitions not in the epistemologically lower-level fact or abstractions, but in as higher-level abstractions as is possible for them to do. (Basically they all are Platonists at heart, in short—whether explicitly or implicitly).

That’s why, when you argue with them in the preceding manner, they will simply come to spread a large vague smile on their own face, and point out to you the irrefutable fact that the cost-function surface makes a direct reference only to the parameters-space (i.e. a space spanned by the variations in \beta_m‘s). The smile would be vague, because they do see your point, but even if they do, they also know that answering this way, they would succeed in having silenced you. Such an outcome, in their rule-book of the intellectuals, stands for victory—theirs. It makes them feel superior. That’s what they really are longing for.

It’s my statistical prediction that most statisticians would answer thusly with you. (Also most any Indian “intellectual”. [Indians are highly casteist a people—statistically speaking. No, don’t go by my word for it; go ask any psephologist worth his salt. So, the word should be: caste-intellectuals. But I look for the “outliers” here, and so, I don’t use that term—scare-quotes are enough.])

In the case this most probable event occurs, just leave them alone, come back, and enjoy some songs. …

Later on, who knows, you might come to realize that even if Platonism diminishes their discipline and personae, what they have developed as a profession and given you despite their Platonism, had enough of good elements which you could use practically. (Not because Platonism gives good things, but because this field is science, ultimately.)

…Why, in a more charitable mood, you might even want to thank them for having done that—for having given you some useful tools, even if they never clarified all their crucial concepts well enough to you. They could not have—given their Platonism and/or mysticism. So, when you are in a sufficiently good mood, just thank them, and leave them alone. … “c’est la vie…”

Homework 2:

Write a brief (one page, 200–300 words) summary for this post. Include all the essentials. Then, if possible, also make a flash card or two out of it. For neat examples, check out Chris Albon’s cards [^].


A song I like:

(Hindi) “gaataa rahe meraa dil…”
Music: S. D. Burman (with R.D. and team’s assistance)
Singers: Kishore Kumar, Lata Mangeshkar
Lyrics: Shailendra


History:
— First Published: 2019.11.29 10:52 IST.

PS: Typos to be corrected later today.

 

Data science code snippets—2: Using world’s simplest ANN to visualize that deep learning is hard

Good morning! [LOL!]


In the context of machine learning, “learning” only means: changes that get effected to the weights and biases of a network. Nothing more, nothing less.

I wanted to understand exactly how the learning propagates through the layers of a network. So, I wrote the following piece of code. Actually, the development of the code went through two distinct stages.


First stage: A two-layer network (without any hidden layer); only one neuron per layer:

In the first stage, I wrote what demonstrably is world’s simplest possible artificial neural network. This network comes with only two layers: the input layer, and the output layer. Thus, there is no hidden layer at all.

The advantage with such a network (“linear” or “1D” and without hidden layer) is that the total error at the output layer is a function of just one weight and one bias. Thus the total error (or the cost function) is a function of only two independent variables. Therefore, the cost function-surface can (at all) be directly visualized or plotted. For ease of coding, I plotted this function as a contour plot, but a 3D plot also should be possible.

Here are couple of pictures it produces. The input randomly varies in the interval [0.0, 1.0) (i.e., excluding 1.0), and the target is kept as 0.15.

The following plot shows an intermittent stage during training, a stage when the gradient descent algorithm is still very much in progress.

SImplest network with just input and output layers, with only one neuron per layer: Gradient descent in progress.

The following plot shows when the gradient descent has taken the configuration very close to the local (and global) minimum:

SImplest network with just input and output layers, with only one neuron per layer: Gradient descent near the local (and global) minimum.


Second stage: An n-layer network (with many hidden layers); only one neuron per layer

In the second stage, I wanted to show that even if you add one or more hidden layers, the gradient descent algorithm works in such a way that most of the learning occurs only near the output layer.

So, I generalized the code a bit to have an arbitrary number of hidden layers. However, the network continues to have only one neuron per layer (i.e., it maintains the topology of the bogies of a train). I then added a visualization showing the percent changes in the biases and weights at each layer, as learning progresses.

Here is a representative picture it produces when the total number of layers is 5 (i.e. when there are 3 hidden layers). It was made with both the biases and the weights all being set to the value of 2.0:

It is clearly seen that almost all of learning is limited only to the output layer; the hidden layers have learnt almost nothing!

Now, by way of a contrast, here is what happens when you have all initial biases of 0.0, and all initial weights of 2.0:

Here, the last hidden layer has begun learning enough that it shows some visible change during training, even though the output layer learns much more than does the last hidden layer. Almost all learning is via changes to weights, not biases.

Next, here is the reverse situation: when you have all initial biases of 2.0, but all initial weights of 0.0:

The bias of the hidden layer does undergo a slight change, but in an opposite (positive) direction. Compared to the case just above, the learning now is relatively more concentrated in the output layer.

Finally, here is what happens when you initialize both biases and weights to 0.0.

The network does learn (the difference in the predicted vs. target does go on diminishing as the training progresses). However, the percentage change is too small to be visually registered (when plotted to the same scale as what was used earlier).


The code:

Here is the code which produced all the above plots (but you have to suitably change the hard-coded parameters to get to each of the above cases):

'''
SimplestANNInTheWorld.py
Written by and Copyright (c) Ajit R. Jadhav. All rights reserved.
-- Implements the case-study of the simplest possible 1D ANN.
-- Each layer has only neuron. Naturally, there is only one target!
-- It even may not have hidden layers. 
-- However, it can have an arbitrary number of hidden layers. This 
feature makes it a good test-bed to see why and how the neurons 
in the hidden layers don't learn much during deep learning, during 
a ``straight-forward'' application of the gradient descent algorithm. 
-- Please do drop me a comment or an email 
if you find this code useful in any way, 
say, in a corporate training setup or 
in academia. Thanks in advance!
-- History:
* 30 December 2018 09:27:57  IST: 
Project begun
* 30 December 2018 11:57:44  IST: 
First version that works.
* 01 January 2019 12:11:11  IST: 
Added visualizations for activations and gradient descent for the 
last layer, for no. of layers = 2 (i.e., no hidden layers). 
* 01 January 2019 18:54:36  IST:
Added visualizations for percent changes in biases and 
weights, for no. of layers >=3 (i.e. at least one hidden layer).
* 02 January 2019 08:40:17  IST: 
The version as initially posted on my blog.
'''
import numpy as np 
import matplotlib.pyplot as plt 

################################################################################
# Functions to generate the input and test data

def GenerateDataRandom( nTrainingCases ):
    # Note: randn() returns samples from the normal distribution, 
    # but rand() returns samples from the uniform distribution: [0,1). 
    adInput = np.random.rand( nTrainingCases )
    return adInput

def GenerateDataSequential( nTrainingCases ):
    adInput = np.linspace( 0.0, 1.0, nTrainingCases )
    return adInput

def GenerateDataConstant( nTrainingCases, dVal ):
    adInput = np.full( nTrainingCases, dVal )
    return adInput

################################################################################
# Functions to generate biases and weights

def GenerateBiasesWeightsRandom( nLayers ):
    adAllBs = np.random.randn( nLayers-1 )
    adAllWs = np.random.randn( nLayers-1 )
    return adAllBs, adAllWs

def GenerateBiasesWeightsConstant( nLayers, dB, dW ):
    adAllBs = np.ndarray( nLayers-1 )
    adAllBs.fill( dB )
    adAllWs = np.ndarray( nLayers-1 )
    adAllWs.fill( dW )
    return adAllBs, adAllWs

################################################################################
# Other utility functions

def Sigmoid( dZ ):
    return 1.0 / ( 1.0 + np.exp( - dZ ) )

def SigmoidDerivative( dZ ):
    dA = Sigmoid( dZ )
    dADer = dA * ( 1.0 - dA )
    return dADer

# Server function. Called with activation at the output layer. 
# In this script, the target value is always one and the 
# same, i.e., 1.0). 
# Assumes that the form of the cost function is: 
#       C_x = 0.5 * ( dT - dA )^2 
# where, note carefully, that target comes first. 
# Hence the partial derivative is:
#       \partial C_x / \partial dA = - ( dT - dA ) = ( dA - dT ) 
# where note carefully that the activation comes first.
def CostDerivative( dA, dTarget ):
    return ( dA - dTarget ) 

def Transpose( dA ):
    np.transpose( dA )
    return dA

################################################################################
# Feed-Forward

def FeedForward( dA ):
    ## print( "\tFeed-forward" )
    l_dAllZs = []
    # Note, this makes l_dAllAs have one extra data member
    # as compared to l_dAllZs, with the first member being the
    # supplied activation of the input layer 
    l_dAllAs = [ dA ]
    nL = 1
    for w, b in zip( adAllWs, adAllBs ):
        dZ = w * dA + b
        l_dAllZs.append( dZ )
        # Notice, dA has changed because it now refers
        # to the activation of the current layer (nL) 
        dA = Sigmoid( dZ )  
        l_dAllAs.append( dA )
        ## print( "\tLayer: %d, Z: %lf, A: %lf" % (nL, dZ, dA) )
        nL = nL + 1
    return ( l_dAllZs, l_dAllAs )

################################################################################
# Back-Propagation

def BackPropagation( l_dAllZs, l_dAllAs ):
    ## print( "\tBack-Propagation" )
    # Step 1: For the Output Layer
    dZOP = l_dAllZs[ -1 ]
    dAOP = l_dAllAs[ -1 ]
    dZDash = SigmoidDerivative( dZOP )
    dDelta = CostDerivative( dAOP, dTarget ) * dZDash
    dGradB = dDelta
    adAllGradBs[ -1 ] = dGradB

    # Since the last hidden layer has only one neuron, no need to take transpose.
    dAPrevTranspose = Transpose( l_dAllAs[ -2 ] )
    dGradW = np.dot( dDelta, dAPrevTranspose )
    adAllGradWs[ -1 ] = dGradW
    ## print( "\t* Layer: %d\n\t\tGradB: %lf, GradW: %lf" % (nLayers-1, dGradB, dGradW) )

    # Step 2: For all the hidden layers
    for nL in range( 2, nLayers ):
        dZCur = l_dAllZs[ -nL ]
        dZCurDash = SigmoidDerivative( dZCur )
        dWNext = adAllWs[ -nL+1 ]
        dWNextTranspose = Transpose( dWNext )
        dDot = np.dot( dWNextTranspose, dDelta ) 
        dDelta = dDot * dZCurDash
        dGradB = dDelta
        adAllGradBs[ -nL ] = dGradB

        dAPrev = l_dAllAs[ -nL-1 ]
        dAPrevTrans = Transpose( dAPrev )
        dGradWCur = np.dot( dDelta, dAPrevTrans )
        adAllGradWs[ -nL ] = dGradWCur

        ## print( "\tLayer: %d\n\t\tGradB: %lf, GradW: %lf" % (nLayers-nL, dGradB, dGradW) )

    return ( adAllGradBs, adAllGradWs )

def PlotLayerwiseActivations( c, l_dAllAs, dTarget ):
    plt.subplot( 1, 2, 1 ).clear()
    dPredicted = l_dAllAs[ -1 ]
    sDesc = "Activations at Layers. Case: %3d\nPredicted: %lf, Target: %lf" % (c, dPredicted, dTarget) 
    plt.xlabel( "Layers" )
    plt.ylabel( "Activations (Input and Output)" )
    plt.title( sDesc )
    
    nLayers = len( l_dAllAs )
    dES = 0.2	# Extra space, in inches
    plt.axis( [-dES, float(nLayers) -1.0 + dES, -dES, 1.0+dES] )
    
    # Plot a vertical line at the input layer, just to show variations
    plt.plot( (0,0), (0,1), "grey" )
    
    # Plot the dots for the input and hidden layers
    for i in range( nLayers-1 ):
        plt.plot( i, l_dAllAs[ i ], 'go' )
    # Plot the dots for the output layer
    plt.plot( nLayers-1, dPredicted, 'bo' )
    plt.plot( nLayers-1, dTarget, 'ro' )
    
def PlotGradDescent( c, dOrigB, dOrigW, dB, dW ):
    plt.subplot( 1, 2, 2 ).clear()
    
    d = 5.0
    ContourSurface( d )
    plt.axis( [-d, d, -d, d] ) 
    plt.plot( dOrigB, dOrigW, 'bo' )
    plt.plot( dB, dW, 'ro' )
    plt.grid()
    plt.xlabel( "Biases" )
    plt.ylabel( "Weights" )
    sDesc = "Gradient Descent for the Output Layer.\n" \
    "Case: %3d\nWeight: %lf, Bias: %lf" % (c, dW, dB) 
    plt.title( sDesc )
    
    
def ContourSurface( d ):
    nDivs = 10
    dDelta = d / nDivs
    w = np.arange( -d, d, dDelta )
    b = np.arange( -d, d, dDelta )
    W, B = np.meshgrid( w, b ) 
    A = Sigmoid( W + B )
    plt.imshow( A, interpolation='bilinear', origin='lower',
                cmap=plt.cm.Greys, # cmap=plt.cm.RdYlBu_r, 
                extent=(-d, d, -d, d), alpha=0.8 )
    CS = plt.contour( B, W, A )
    plt.clabel( CS, inline=1, fontsize=7 )

def PlotLayerWiseBiasesWeights( c, adOrigBs, adAllBs, adOrigWs, adAllWs, dPredicted, dTarget ):
    plt.clf()
    
    nComputeLayers = len( adOrigBs )
    plt.axis( [-0.2, nComputeLayers+0.7, -320.0, 320.0] )
    
    adBPct = GetPercentDiff( nComputeLayers, adAllBs, adOrigBs )
    adWPct = GetPercentDiff( nComputeLayers, adAllWs, adOrigWs )
    print( "Case: %3d" \
    "\nPercent Changes in Biases:\n%s" \
    "\nPercent Changes in Weights:\n%s\n" \
     % (c, adBPct, adWPct)  )
    adx = np.linspace( 0.0, nComputeLayers-1, nComputeLayers )
    plt.plot( adx + 1.0, adWPct, 'ro' )
    plt.plot( adx + 1.15, adBPct, 'bo' )
    plt.grid()
    plt.xlabel( "Layer Number" )
    plt.ylabel( "Percent Change in Weight (Red) and Bias (Blue)" )
    sTitle = "How most learning occurs only at an extreme layer\n" \
    "Percent Changes to Biases and Weights at Each Layer.\n" \
    "Training case: %3d, Target: %lf, Predicted: %lf" % (c, dTarget, dPredicted) 
    plt.title( sTitle )

def GetPercentDiff( n, adNow, adOrig ):
    adDiff = adNow - adOrig
    print( adDiff )
    adPct = np.zeros( n )
    dSmall = 1.0e-10
    if all( abs( adDiff ) ) > dSmall and all( abs(adOrig) ) > dSmall:
        adPct = adDiff / adOrig * 100.0
    return adPct


################################################################################
# The Main Script
################################################################################

dEta = 1.0 # The learning rate
nTrainingCases = 100
nTestCases = nTrainingCases // 5
adInput = GenerateDataRandom( nTrainingCases ) #, 0.0 )
adTest = GenerateDataRandom( nTestCases )

np.random.shuffle( adInput )
## print( "Data:\n %s" % (adInput) )

# Must be at least 2. Tested up to 10 layers.
nLayers = 2
# Just a single target! Keep it in the interval (0.0, 1.0), 
# i.e., excluding both the end-points of 0.0 and 1.0.

dTarget = 0.15

# The input layer has no biases or weights. Even the output layer 
# here has only one target, and hence, only one neuron.
# Hence, the weights matrix for all layers now becomes just a 
# vector.
# For visualization with a 2 layer-network, keep biases and weights 
# between [-4.0, 4.0]

# adAllBs, adAllWs = GenerateBiasesWeightsRandom( nLayers )
adAllBs, adAllWs = GenerateBiasesWeightsConstant( nLayers, 2.0, 2.0 )
dOrigB = adAllBs[-1]
dOrigW = adAllWs[-1]
adOrigBs = adAllBs.copy()
adOrigWs = adAllWs.copy()

## print( "Initial Biases\n", adAllBs )
## print( "Initial Weights\n", adAllWs )

plt.figure( figsize=(10,5) )
    
# Do the training...
# For each input-target pair,
for c in range( nTrainingCases ):
    dInput = adInput[ c ]
    ## print( "Case: %d. Input: %lf" % (c, dInput) )
    
    adAllGradBs = [ np.zeros( b.shape ) for b in adAllBs ]
    adAllGradWs = [ np.zeros( w.shape ) for w in adAllWs ]
    
    # Do the feed-forward, initialized to dA = dInput
    l_dAllZs, l_dAllAs = FeedForward( dInput )
    
    # Do the back-propagation
    adAllGradBs, adAllGradWs = BackPropagation( l_dAllZs, l_dAllAs )

    ## print( "Updating the network biases and weights" )
    adAllBs = [ dB - dEta * dDeltaB 
                for dB, dDeltaB in zip( adAllBs, adAllGradBs ) ]
    adAllWs = [ dW - dEta * dDeltaW 
                for dW, dDeltaW in zip( adAllWs, adAllGradWs ) ]
    
    ## print( "The updated network biases:\n", adAllBs )
    ## print( "The updated network weights:\n", adAllWs )

    if 2 == nLayers:
        PlotLayerwiseActivations( c, l_dAllAs, dTarget )
        dW = adAllWs[ -1 ]
        dB = adAllBs[ -1 ]
        PlotGradDescent( c, dOrigB, dOrigW, dB, dW )
    else:
        # Plot in case of many layers: Original and Current Weights, Biases for all layers
        # and Activations for all layers 
        dPredicted = l_dAllAs[ -1 ]
        PlotLayerWiseBiasesWeights( c, adOrigBs, adAllBs, adOrigWs, adAllWs, dPredicted, dTarget )
    plt.pause( 0.1 )

plt.show()

# Do the testing
print( "\nTesting..." )
for c in range( nTestCases ):
    
    dInput = adTest[ c ]
    print( "\tTest Case: %d, Value: %lf" % (c, dInput) )
    
    l_dAllZs, l_dAllAs = FeedForward( dInput )
    dPredicted = l_dAllAs[ -1 ]
    dDiff = dTarget - dPredicted
    dCost = 0.5 * dDiff * dDiff
    print( "\tInput: %lf, Predicted: %lf, Target: %lf, Difference: %lf, Cost: %lf\n" % (dInput, dPredicted, dTarget, dDiff, dCost) )

print( "Done!" )

 


Things you can try:

  • Change one or more of the following parameters, and see what happens:
    • Target value
    • Values of initial weights and biases
    • Number of layers
    • The learning rate, dEta
  • Change the cost function; e.g., try the linear function instead of the Sigmoid. Change the code accordingly.
  • Also, try to conceptually see what would happen when the number of neurons per layer is 2 or more…

Have fun!


A song I like:

(Marathi) “pahaaTe pahaaTe malaa jaag aalee”
Music and Singer: C. Ramchandra
Lyrics: Suresh Bhat