Data Science links—1

Oakay… My bookmarks library has grown too big. Time to move at least a few of them to a blog-post. Here they are. … The last one is not on Data Science, but it happens to be the most important one of them all!



On Bayes’ theorem:

Oscar Bonilla. “Visualizing Bayes’ theorem” [^].

Jayesh Thukarul. “Bayes’ Theorem explained” [^].

Victor Powell. “Conditional probability” [^].


Explanations with visualizations:

Victor Powell. “Explained Visually.” [^]

Christopher Olah. Many topics [^]. For instance, see “Calculus on computational graphs: backpropagation” [^].


Fooling the neural network:

Julia Evans. “How to trick a neural network into thinking a panda is a vulture” [^].

Andrej Karpathy. “Breaking linear classifiers on ImageNet” [^].

A. Nguyen, J. Yosinski, and J. Clune. “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images” [^]

Melanie Mitchell. “Artificial Intelligence hits the barrier of meaning” [^]


The Most Important link!

Ijad Madisch. “Why I hire scientists, and why you should, too” [^]


A song I like:

(Western, pop) “Billie Jean”
Artist: Michael Jackson

[Back in the ’80s, this song used to get played in the restaurants from the Pune camp area, and also in the cinema halls like West-End, Rahul, Alka, etc. The camp area was so beautiful, back then—also uncrowded, and quiet.

This song would also come floating on the air, while sitting in the evening at the Quark cafe, situated in the middle of all the IITM hostels (next to skating rink). Some or the other guy would be playing it in a nearby hostel room on one of those stereo systems which would come with those 1 or 2 feet tall “hi-fi” speaker-boxes. Each box typically had three stacked speakers. A combination of a separately sitting sub-woofer with a few small other boxes or a soundbar, so ubiquitous today, had not been invented yet… Back then, Quark was a completely open-air cafe—a small patch of ground surrounded by small trees, and a tiny hexagonal hut, built in RCC, for serving snacks. There were no benches, even, at Quark. People would sit on those small concrete blocks (brought from the civil department where they would come for testing). Deer would be roaming very nearby around. A daring one or two could venture to come forward and eat pizza out of your (fully) extended hand!…

…Anyway, coming back to the song itself, I had completely forgotten it, but got reminded when @curiouswavefn mentioned it in one of his tweets recently. … When I read the tweet, I couldn’t make out that it was this song (apart from Bach’s variations) that he was referring to. I just idly checked out both of them, and then, while listening to it, I suddenly recognized this song. … You see, unlike so many other guys of e-schools of our times, I wouldn’t listen to a lot of Western pop-songs those days (and still don’t). Beatles, ABBA and a few other groups/singers, may be, also the Western instrumentals (a lot) and the Western classical music (some, but definitely). But somehow, I was never too much into the Western pop songs. … Another thing. The way these Western singers sing, it used to be very, very hard for me to figure out the lyrics back then—and the situation continues mostly the same way even today! So, recognizing a song by its name was simply out of the question….

… Anyway, do check out the links (even if some of them appear to be out of your reach on the first reading), and enjoy the song. … Take care, and bye for now…]

 

Advertisements

Update: Pursuing some simple (and possibly new) ideas in Data Science

Last Saturday, I attended a Data Science-related meetup in Pune (the one organized by DataGiri). I enjoyed all the four sessions covered in it (one each on logistic regression, SVM, clustering, and ensemble methods). … Out of the past 4/5 events or 1-day introductory workshops on ML/DL which I have attended so far in Pune, I think this one was by far the best.

Attending events like these (also conferences) often has an effect: due to the informality of the interaction, you begin to look at the same things from a slightly different perspective. That precisely is what seems to have happened to me this time round.

Cutting straight to the point, I think that after attending this event, I might have stumbled across a couple of small little ideas concerning the techniques that were discussed. These ideas could have an element of novelty. At least that’s what I feel. … Several Internet searches (and consulting standard books up to Bishop and ESLII) hasn’t thrown up something similar so far. So, who knows… And yes, it’s not just the novelty; there also should be some advantages to be had in practical applications too.

Of course, Data Science is relatively a new field for me, and so, my knowledge of these topics is pretty limited. Still, currently, I am engaged in taking these ideas a little further. From what I have come across thus far, it does look like there should be something to these ideas. But I need to both flesh out the ideas and take the literature-search further… much, much further.

At the same time, I am also having a look at the angle of whether a patent or two can come out of these ideas or not. So far, the prospects do seem promising. So, if you have the means to sponsor patents, and if NDAs are OK by you, then feel free to get in touch with me for some more details and the current status of development.

Bottomline: Nothing major here; just a couple of small ideas (or small variations on the known techniques). But they do seem neat and novel. In any case, they certainly are worth pursuing a bit further.

…Take care and bye for now…


A song I like:

(Hindi) “mere jaise ban jaaoge…”
Singers: Jagjit and Chitra Singh
Lyrics: Saeed Rahi (?)
Music: Jagjit Singh

 

Non-Interview Questions on Data Science—Part 1

This entry is the first in a series of posts which will note some of the questions that no one will ever ask you during any interview for any position in the Data Science industry.

Naturally, if you ask for my opinion, you should not consider modifying these questions a bit and posting them as a part of your own post on Medium.com, AnalyticsVidhya, KDNuggets, TowardsDataScience, ComingFromDataScience, etc.

No, really! There would be no point in lifting these questions and posting them as if they were yours, because no one in the industry is ever going to get impressed by you because you raised them. … I am posting them here simply because… because “I am like that only.”

OK, so here is the first installment in this practically useless series. (I should know. I go jobless.)

(Part 1 mostly covers linear and logistic regression, and just a bit of probability.)


Q.1: Consider the probability theory. How are the following ideas related to each other?: random phenomenon, random experiment, trial, result, outcome, outcome space, sample space, event, random variable, and probability distribution. In particular, state precisely the difference between a result and an outcome, and between an outcome and an event.

Give a few examples of finite and countably infinite sample spaces. Give one example of a random variable whose domain is not the real number line. (Hint: See the Advise at the end of this post concerning which books to consult.)


Q.2: In the set theory, when a set is defined through enumeration, repeated instances are not included in the definition. In the light of this fact, answer the following question: Is an event a set? or is it just a primitive instance subsumed in a set? What precisely is the difference between a trial, a result of a trial, and an event? (Hint: See the Advise at the end of this post concerning which books to consult.)


Q.3: Select the best alternative: In regression for making predictions with a continuous target data, if a model is constructed in reference to the equation y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3, then:
(a) It is a sub-type of the linear regression model.
(b) It is a polynomial regression model.
(c) It is a nonlinear regression model because powers > 1 of the independent variable x_i are involved.
(d) It is a nonlinear regression model because more than two \beta_m terms are involved.
(e) Both (a) and (b)
(g) Both (b) and (c)
(f) Both (c) and (d)
(g) All of (b), (c), and (d)
(h) None of the above.
(Hint: Don’t rely too much on the textbooks being used by the BE (CS) students in the leading engineering colleges in Pune and Mumbai.)


Q.4: Consider a data-set consisting of performance of students on a class test. It has three columns: student ID, hours studied, and marks obtained. Suppose you decide to use the simple linear regression technique to make predictions.

Let’s say that you assume that the hours studied are the independent variable (predictor), and the marks obtained are the dependent variable (response). Making this assumption, you make a scatter plot, carry out the regression, and plot the regression line predicted by the model too.

The question now is: If you interchange the designations of the dependent and independent variables (i.e., if you take the marks obtained as predictors and the hours studied as responses), build a second linear model on this basis, and plot the regression line thus predicted, will it coincide with the line plotted earlier or not. Why or why not?

Repeat the question for the polynomial regression. Repeat the question if you include the simplest interaction term in the linear model.


Q.5: Draw a schematic diagram showing circles for nodes and straight-lines for connections (as in the ANN diagrams) for a binary logistic regression machine that operates on just one feature. Wonder why your text-book didn’t draw it in the chapter on the logistic regression.


Q.6: Suppose that the training input for a classification task consists of r number of distinct data-points and c number of features. If logistic regression is to be used for classification of this data, state the number of the unknown parameters there would be. Make suitable assumptions as necessary, and state them.


Q.7: Obtain (or write) some simple Python code for implementing from the scratch a single-feature binary logistic regression machine that uses the simple (non-stochastic) gradient descent method that computes the gradient for each row (batch-size of 1).

Modify the code to show a real-time animation of how the model goes on changing as the gradient descent algorithm progresses. The animation should depict a scatter plot of the sample data (y vs. x) and not the parameters space (\beta_0 vs. \beta_1). The animation should highlight the data-point currently being processed in a separate color. It should also show a plot of the logistic function on the same graph.

Can you imagine, right before running (or even building) the animation, what kind of visual changes is the animation going to depict? how?


Q.8: What are the important advantage of the stochastic gradient descent method over the simple (non-stochastic) gradient descent?


Q.9: State true or false: (i) The output of the logistic function is continuous. (ii) The minimization of the cost function in logistic regression involves a continuous dependence on the undetermined parameters.

In the light of your answers, explain the reason why the logistic regression can at all be used as a classification mechanism (i.e. for targets that are “discrete”, not continuous). State only those axioms of the probability theory which are directly relevant here.


Q.10: Draw diagrams in the parameters-space for the Lasso regression and the Ridge regression. The question now is to explain precisely what lies inside the square or circular region. In each case, draw an example path that might get traced during the gradient descent, and clearly explain why the progress occurs the way it does.


Q.11: Briefly explain how the idea of the logistic regression gets applied in the artificial neural networks (ANNs). Suppose that a training data-set has c number of features, r number of data-rows, and M number of output bins (i.e. classification types). Assuming that the neural network does not carry any hidden layers, calculate the number of logistic regressions that would be performed in a single batch. Make suitable assumptions as necessary.

Does your answer change if you consider the multinomial logistic regression?


Q.12: State the most prominent limitation of the gradient descent methods. State the name of any one technique which can overcome this limitation.


Advise: To answer the first two questions, don’t refer to the programming books. In fact, don’t even rely too much on the usual textbooks. Even Wasserman skips over the topic and Stirzaker is inadquate. Kreyszig is barely OK. A recommended text (more rigorous but UG-level, and brief) for this topic is: “An Introduction to Probability and Statistics” (2015) Rohatgi and Saleh, Wiley.


Awww… Still with me?

If you read this far, chances are very bright that you are really^{really} desperately looking for a job in the data science field. And, as it so happens, I am also a very, very kind hearted person. I don’t like to disappoint nice, ambitious… err… “aspiring” people. So, let me offer you some real help before you decide to close this page (and this blog) forever.

Here is one question they might actually ask you during an interview—especially if the interviewer is an MBA:

A question they might actually ask you in an interview: What are the three V’s of big data? four? five?

(Yes, MBA’s do know arithmetic. At least, it was there on their CAT / GMAT entrance exams. Yes, you can use this question for your posts on Medium.com, AnalyticsVidhya, KDNuggets, TowardsDataScience, ComingFromDataScience, etc.)


A couple of notes:

  1. I might come back and revise the questions to make them less ambiguous or more precise.
  2. Also, please do drop a line if any of the questions is not valid, or shows a poor understanding on my part—this is easily possible.

 


A song I like:

[Credits listed in a random order. Good!]

(Hindi) “mausam kee sargam ko sun…”
Music: Jatin-Lalit
Singer: Kavita Krishnamoorthy
Lyrics: Majrooh Sultanpuri


History:

First written: Friday 14 June 2019 11:50:25 AM IST.
Published online: 2019.06.16 12:45 IST.
The songs section added: 2019.06.16 22:18 IST.

The Machine Learning as an Expert System

To cut a somewhat long story short, I think that I can “see” that Machine Learning (including Deep Learning) can actually be regarded as a rules-based expert system, albeit of a special kind.

I am sure that people must have written articles expressing this view. However, simple googling didn’t get me to any useful material.

I would deeply appreciate it if someone could please point out references in this direction. Thanks in advance.


BTW, here is a very neat infographic on AI: [^]; h/t [^]. … Once you finish reading it, re-read this post, please! Exactly once again, and only the first part—i.e., without recursion!. …


A song I like:

(Marathi) “visar preet, visar geet, visar bheT aapuli”
Music: Yashwant Dev
Lyrics: Shantaram Nandgaonkar
Singer: Sudhir Phadke

 

Data science code snippets—2: Using world’s simplest ANN to visualize that deep learning is hard

Good morning! [LOL!]


In the context of machine learning, “learning” only means: changes that get effected to the weights and biases of a network. Nothing more, nothing less.

I wanted to understand exactly how the learning propagates through the layers of a network. So, I wrote the following piece of code. Actually, the development of the code went through two distinct stages.


First stage: A two-layer network (without any hidden layer); only one neuron per layer:

In the first stage, I wrote what demonstrably is world’s simplest possible artificial neural network. This network comes with only two layers: the input layer, and the output layer. Thus, there is no hidden layer at all.

The advantage with such a network (“linear” or “1D” and without hidden layer) is that the total error at the output layer is a function of just one weight and one bias. Thus the total error (or the cost function) is a function of only two independent variables. Therefore, the cost function-surface can (at all) be directly visualized or plotted. For ease of coding, I plotted this function as a contour plot, but a 3D plot also should be possible.

Here are couple of pictures it produces. The input randomly varies in the interval [0.0, 1.0) (i.e., excluding 1.0), and the target is kept as 0.15.

The following plot shows an intermittent stage during training, a stage when the gradient descent algorithm is still very much in progress.

SImplest network with just input and output layers, with only one neuron per layer: Gradient descent in progress.

The following plot shows when the gradient descent has taken the configuration very close to the local (and global) minimum:

SImplest network with just input and output layers, with only one neuron per layer: Gradient descent near the local (and global) minimum.


Second stage: An n-layer network (with many hidden layers); only one neuron per layer

In the second stage, I wanted to show that even if you add one or more hidden layers, the gradient descent algorithm works in such a way that most of the learning occurs only near the output layer.

So, I generalized the code a bit to have an arbitrary number of hidden layers. However, the network continues to have only one neuron per layer (i.e., it maintains the topology of the bogies of a train). I then added a visualization showing the percent changes in the biases and weights at each layer, as learning progresses.

Here is a representative picture it produces when the total number of layers is 5 (i.e. when there are 3 hidden layers). It was made with both the biases and the weights all being set to the value of 2.0:

It is clearly seen that almost all of learning is limited only to the output layer; the hidden layers have learnt almost nothing!

Now, by way of a contrast, here is what happens when you have all initial biases of 0.0, and all initial weights of 2.0:

Here, the last hidden layer has begun learning enough that it shows some visible change during training, even though the output layer learns much more than does the last hidden layer. Almost all learning is via changes to weights, not biases.

Next, here is the reverse situation: when you have all initial biases of 2.0, but all initial weights of 0.0:

The bias of the hidden layer does undergo a slight change, but in an opposite (positive) direction. Compared to the case just above, the learning now is relatively more concentrated in the output layer.

Finally, here is what happens when you initialize both biases and weights to 0.0.

The network does learn (the difference in the predicted vs. target does go on diminishing as the training progresses). However, the percentage change is too small to be visually registered (when plotted to the same scale as what was used earlier).


The code:

Here is the code which produced all the above plots (but you have to suitably change the hard-coded parameters to get to each of the above cases):

'''
SimplestANNInTheWorld.py
Written by and Copyright (c) Ajit R. Jadhav. All rights reserved.
-- Implements the case-study of the simplest possible 1D ANN.
-- Each layer has only neuron. Naturally, there is only one target!
-- It even may not have hidden layers. 
-- However, it can have an arbitrary number of hidden layers. This 
feature makes it a good test-bed to see why and how the neurons 
in the hidden layers don't learn much during deep learning, during 
a ``straight-forward'' application of the gradient descent algorithm. 
-- Please do drop me a comment or an email 
if you find this code useful in any way, 
say, in a corporate training setup or 
in academia. Thanks in advance!
-- History:
* 30 December 2018 09:27:57  IST: 
Project begun
* 30 December 2018 11:57:44  IST: 
First version that works.
* 01 January 2019 12:11:11  IST: 
Added visualizations for activations and gradient descent for the 
last layer, for no. of layers = 2 (i.e., no hidden layers). 
* 01 January 2019 18:54:36  IST:
Added visualizations for percent changes in biases and 
weights, for no. of layers >=3 (i.e. at least one hidden layer).
* 02 January 2019 08:40:17  IST: 
The version as initially posted on my blog.
'''
import numpy as np 
import matplotlib.pyplot as plt 

################################################################################
# Functions to generate the input and test data

def GenerateDataRandom( nTrainingCases ):
    # Note: randn() returns samples from the normal distribution, 
    # but rand() returns samples from the uniform distribution: [0,1). 
    adInput = np.random.rand( nTrainingCases )
    return adInput

def GenerateDataSequential( nTrainingCases ):
    adInput = np.linspace( 0.0, 1.0, nTrainingCases )
    return adInput

def GenerateDataConstant( nTrainingCases, dVal ):
    adInput = np.full( nTrainingCases, dVal )
    return adInput

################################################################################
# Functions to generate biases and weights

def GenerateBiasesWeightsRandom( nLayers ):
    adAllBs = np.random.randn( nLayers-1 )
    adAllWs = np.random.randn( nLayers-1 )
    return adAllBs, adAllWs

def GenerateBiasesWeightsConstant( nLayers, dB, dW ):
    adAllBs = np.ndarray( nLayers-1 )
    adAllBs.fill( dB )
    adAllWs = np.ndarray( nLayers-1 )
    adAllWs.fill( dW )
    return adAllBs, adAllWs

################################################################################
# Other utility functions

def Sigmoid( dZ ):
    return 1.0 / ( 1.0 + np.exp( - dZ ) )

def SigmoidDerivative( dZ ):
    dA = Sigmoid( dZ )
    dADer = dA * ( 1.0 - dA )
    return dADer

# Server function. Called with activation at the output layer. 
# In this script, the target value is always one and the 
# same, i.e., 1.0). 
# Assumes that the form of the cost function is: 
#       C_x = 0.5 * ( dT - dA )^2 
# where, note carefully, that target comes first. 
# Hence the partial derivative is:
#       \partial C_x / \partial dA = - ( dT - dA ) = ( dA - dT ) 
# where note carefully that the activation comes first.
def CostDerivative( dA, dTarget ):
    return ( dA - dTarget ) 

def Transpose( dA ):
    np.transpose( dA )
    return dA

################################################################################
# Feed-Forward

def FeedForward( dA ):
    ## print( "\tFeed-forward" )
    l_dAllZs = []
    # Note, this makes l_dAllAs have one extra data member
    # as compared to l_dAllZs, with the first member being the
    # supplied activation of the input layer 
    l_dAllAs = [ dA ]
    nL = 1
    for w, b in zip( adAllWs, adAllBs ):
        dZ = w * dA + b
        l_dAllZs.append( dZ )
        # Notice, dA has changed because it now refers
        # to the activation of the current layer (nL) 
        dA = Sigmoid( dZ )  
        l_dAllAs.append( dA )
        ## print( "\tLayer: %d, Z: %lf, A: %lf" % (nL, dZ, dA) )
        nL = nL + 1
    return ( l_dAllZs, l_dAllAs )

################################################################################
# Back-Propagation

def BackPropagation( l_dAllZs, l_dAllAs ):
    ## print( "\tBack-Propagation" )
    # Step 1: For the Output Layer
    dZOP = l_dAllZs[ -1 ]
    dAOP = l_dAllAs[ -1 ]
    dZDash = SigmoidDerivative( dZOP )
    dDelta = CostDerivative( dAOP, dTarget ) * dZDash
    dGradB = dDelta
    adAllGradBs[ -1 ] = dGradB

    # Since the last hidden layer has only one neuron, no need to take transpose.
    dAPrevTranspose = Transpose( l_dAllAs[ -2 ] )
    dGradW = np.dot( dDelta, dAPrevTranspose )
    adAllGradWs[ -1 ] = dGradW
    ## print( "\t* Layer: %d\n\t\tGradB: %lf, GradW: %lf" % (nLayers-1, dGradB, dGradW) )

    # Step 2: For all the hidden layers
    for nL in range( 2, nLayers ):
        dZCur = l_dAllZs[ -nL ]
        dZCurDash = SigmoidDerivative( dZCur )
        dWNext = adAllWs[ -nL+1 ]
        dWNextTranspose = Transpose( dWNext )
        dDot = np.dot( dWNextTranspose, dDelta ) 
        dDelta = dDot * dZCurDash
        dGradB = dDelta
        adAllGradBs[ -nL ] = dGradB

        dAPrev = l_dAllAs[ -nL-1 ]
        dAPrevTrans = Transpose( dAPrev )
        dGradWCur = np.dot( dDelta, dAPrevTrans )
        adAllGradWs[ -nL ] = dGradWCur

        ## print( "\tLayer: %d\n\t\tGradB: %lf, GradW: %lf" % (nLayers-nL, dGradB, dGradW) )

    return ( adAllGradBs, adAllGradWs )

def PlotLayerwiseActivations( c, l_dAllAs, dTarget ):
    plt.subplot( 1, 2, 1 ).clear()
    dPredicted = l_dAllAs[ -1 ]
    sDesc = "Activations at Layers. Case: %3d\nPredicted: %lf, Target: %lf" % (c, dPredicted, dTarget) 
    plt.xlabel( "Layers" )
    plt.ylabel( "Activations (Input and Output)" )
    plt.title( sDesc )
    
    nLayers = len( l_dAllAs )
    dES = 0.2	# Extra space, in inches
    plt.axis( [-dES, float(nLayers) -1.0 + dES, -dES, 1.0+dES] )
    
    # Plot a vertical line at the input layer, just to show variations
    plt.plot( (0,0), (0,1), "grey" )
    
    # Plot the dots for the input and hidden layers
    for i in range( nLayers-1 ):
        plt.plot( i, l_dAllAs[ i ], 'go' )
    # Plot the dots for the output layer
    plt.plot( nLayers-1, dPredicted, 'bo' )
    plt.plot( nLayers-1, dTarget, 'ro' )
    
def PlotGradDescent( c, dOrigB, dOrigW, dB, dW ):
    plt.subplot( 1, 2, 2 ).clear()
    
    d = 5.0
    ContourSurface( d )
    plt.axis( [-d, d, -d, d] ) 
    plt.plot( dOrigB, dOrigW, 'bo' )
    plt.plot( dB, dW, 'ro' )
    plt.grid()
    plt.xlabel( "Biases" )
    plt.ylabel( "Weights" )
    sDesc = "Gradient Descent for the Output Layer.\n" \
    "Case: %3d\nWeight: %lf, Bias: %lf" % (c, dW, dB) 
    plt.title( sDesc )
    
    
def ContourSurface( d ):
    nDivs = 10
    dDelta = d / nDivs
    w = np.arange( -d, d, dDelta )
    b = np.arange( -d, d, dDelta )
    W, B = np.meshgrid( w, b ) 
    A = Sigmoid( W + B )
    plt.imshow( A, interpolation='bilinear', origin='lower',
                cmap=plt.cm.Greys, # cmap=plt.cm.RdYlBu_r, 
                extent=(-d, d, -d, d), alpha=0.8 )
    CS = plt.contour( B, W, A )
    plt.clabel( CS, inline=1, fontsize=7 )

def PlotLayerWiseBiasesWeights( c, adOrigBs, adAllBs, adOrigWs, adAllWs, dPredicted, dTarget ):
    plt.clf()
    
    nComputeLayers = len( adOrigBs )
    plt.axis( [-0.2, nComputeLayers+0.7, -320.0, 320.0] )
    
    adBPct = GetPercentDiff( nComputeLayers, adAllBs, adOrigBs )
    adWPct = GetPercentDiff( nComputeLayers, adAllWs, adOrigWs )
    print( "Case: %3d" \
    "\nPercent Changes in Biases:\n%s" \
    "\nPercent Changes in Weights:\n%s\n" \
     % (c, adBPct, adWPct)  )
    adx = np.linspace( 0.0, nComputeLayers-1, nComputeLayers )
    plt.plot( adx + 1.0, adWPct, 'ro' )
    plt.plot( adx + 1.15, adBPct, 'bo' )
    plt.grid()
    plt.xlabel( "Layer Number" )
    plt.ylabel( "Percent Change in Weight (Red) and Bias (Blue)" )
    sTitle = "How most learning occurs only at an extreme layer\n" \
    "Percent Changes to Biases and Weights at Each Layer.\n" \
    "Training case: %3d, Target: %lf, Predicted: %lf" % (c, dTarget, dPredicted) 
    plt.title( sTitle )

def GetPercentDiff( n, adNow, adOrig ):
    adDiff = adNow - adOrig
    print( adDiff )
    adPct = np.zeros( n )
    dSmall = 1.0e-10
    if all( abs( adDiff ) ) > dSmall and all( abs(adOrig) ) > dSmall:
        adPct = adDiff / adOrig * 100.0
    return adPct


################################################################################
# The Main Script
################################################################################

dEta = 1.0 # The learning rate
nTrainingCases = 100
nTestCases = nTrainingCases // 5
adInput = GenerateDataRandom( nTrainingCases ) #, 0.0 )
adTest = GenerateDataRandom( nTestCases )

np.random.shuffle( adInput )
## print( "Data:\n %s" % (adInput) )

# Must be at least 2. Tested up to 10 layers.
nLayers = 2
# Just a single target! Keep it in the interval (0.0, 1.0), 
# i.e., excluding both the end-points of 0.0 and 1.0.

dTarget = 0.15

# The input layer has no biases or weights. Even the output layer 
# here has only one target, and hence, only one neuron.
# Hence, the weights matrix for all layers now becomes just a 
# vector.
# For visualization with a 2 layer-network, keep biases and weights 
# between [-4.0, 4.0]

# adAllBs, adAllWs = GenerateBiasesWeightsRandom( nLayers )
adAllBs, adAllWs = GenerateBiasesWeightsConstant( nLayers, 2.0, 2.0 )
dOrigB = adAllBs[-1]
dOrigW = adAllWs[-1]
adOrigBs = adAllBs.copy()
adOrigWs = adAllWs.copy()

## print( "Initial Biases\n", adAllBs )
## print( "Initial Weights\n", adAllWs )

plt.figure( figsize=(10,5) )
    
# Do the training...
# For each input-target pair,
for c in range( nTrainingCases ):
    dInput = adInput[ c ]
    ## print( "Case: %d. Input: %lf" % (c, dInput) )
    
    adAllGradBs = [ np.zeros( b.shape ) for b in adAllBs ]
    adAllGradWs = [ np.zeros( w.shape ) for w in adAllWs ]
    
    # Do the feed-forward, initialized to dA = dInput
    l_dAllZs, l_dAllAs = FeedForward( dInput )
    
    # Do the back-propagation
    adAllGradBs, adAllGradWs = BackPropagation( l_dAllZs, l_dAllAs )

    ## print( "Updating the network biases and weights" )
    adAllBs = [ dB - dEta * dDeltaB 
                for dB, dDeltaB in zip( adAllBs, adAllGradBs ) ]
    adAllWs = [ dW - dEta * dDeltaW 
                for dW, dDeltaW in zip( adAllWs, adAllGradWs ) ]
    
    ## print( "The updated network biases:\n", adAllBs )
    ## print( "The updated network weights:\n", adAllWs )

    if 2 == nLayers:
        PlotLayerwiseActivations( c, l_dAllAs, dTarget )
        dW = adAllWs[ -1 ]
        dB = adAllBs[ -1 ]
        PlotGradDescent( c, dOrigB, dOrigW, dB, dW )
    else:
        # Plot in case of many layers: Original and Current Weights, Biases for all layers
        # and Activations for all layers 
        dPredicted = l_dAllAs[ -1 ]
        PlotLayerWiseBiasesWeights( c, adOrigBs, adAllBs, adOrigWs, adAllWs, dPredicted, dTarget )
    plt.pause( 0.1 )

plt.show()

# Do the testing
print( "\nTesting..." )
for c in range( nTestCases ):
    
    dInput = adTest[ c ]
    print( "\tTest Case: %d, Value: %lf" % (c, dInput) )
    
    l_dAllZs, l_dAllAs = FeedForward( dInput )
    dPredicted = l_dAllAs[ -1 ]
    dDiff = dTarget - dPredicted
    dCost = 0.5 * dDiff * dDiff
    print( "\tInput: %lf, Predicted: %lf, Target: %lf, Difference: %lf, Cost: %lf\n" % (dInput, dPredicted, dTarget, dDiff, dCost) )

print( "Done!" )

 


Things you can try:

  • Change one or more of the following parameters, and see what happens:
    • Target value
    • Values of initial weights and biases
    • Number of layers
    • The learning rate, dEta
  • Change the cost function; e.g., try the linear function instead of the Sigmoid. Change the code accordingly.
  • Also, try to conceptually see what would happen when the number of neurons per layer is 2 or more…

Have fun!


A song I like:

(Marathi) “pahaaTe pahaaTe malaa jaag aalee”
Music and Singer: C. Ramchandra
Lyrics: Suresh Bhat