Good morning! [LOL!]

In the context of machine learning, “learning” only means: changes that get effected to the weights and biases of a network. Nothing more, nothing less.

I wanted to understand exactly how the learning propagates through the layers of a network. So, I wrote the following piece of code. Actually, the development of the code went through two distinct stages.

**First stage: A two-layer network (without any hidden layer); only one neuron per layer:**

In the first stage, I wrote what demonstrably is world’s simplest possible artificial neural network. This network comes with only two layers: the input layer, and the output layer. Thus, there is *no hidden layer* at all.

The advantage with such a network (“linear” or “1D” and without hidden layer) is that the total error at the output layer is a function of just one weight and one bias. Thus the total error (or the cost function) is a function of *only two* independent variables. Therefore, the cost function-surface can (at all) be *directly* visualized or plotted. For ease of coding, I plotted this function as a contour plot, but a 3D plot also should be possible.

Here are couple of pictures it produces. The input randomly varies in the interval [0.0, 1.0) (i.e., excluding 1.0), and the target is kept as 0.15.

The following plot shows an intermittent stage during training, a stage when the gradient descent algorithm is still very much in progress.

The following plot shows when the gradient descent has taken the configuration very close to the local (and global) minimum:

**Second stage: An n-layer network (with many hidden layers); only one neuron per layer**

In the second stage, I wanted to show that even if you add one or more hidden layers, the gradient descent algorithm works in such a way that most of the learning occurs only near the output layer.

So, I generalized the code a bit to have an arbitrary number of hidden layers. However, the network continues to have only one neuron per layer (i.e., it maintains the topology of the bogies of a train). I then added a visualization showing the percent changes in the biases and weights at each layer, as learning progresses.

Here is a representative picture it produces when the total number of layers is 5 (i.e. when there are 3 hidden layers). It was made with both the biases and the weights all being set to the value of 2.0:

It is clearly seen that almost all of learning is limited only to the output layer; the hidden layers have learnt almost nothing!

Now, by way of a contrast, here is what happens when you have all initial biases of 0.0, and all initial weights of 2.0:

Here, the last hidden layer has begun learning enough that it shows some visible change during training, even though the output layer learns much more than does the last hidden layer. Almost all learning is via changes to weights, not biases.

Next, here is the reverse situation: when you have all initial biases of 2.0, but all initial weights of 0.0:

The bias of the hidden layer does undergo a slight change, but in an opposite (positive) direction. Compared to the case just above, the learning now is relatively more concentrated in the output layer.

Finally, here is what happens when you initialize both biases and weights to 0.0.

The network does learn (the difference in the predicted vs. target does go on diminishing as the training progresses). However, the *percentage* change is too small to be visually registered (when plotted to the same scale as what was used earlier).

**The code:**

Here is the code which produced all the above plots (but you have to suitably change the hard-coded parameters to get to each of the above cases):

''' SimplestANNInTheWorld.py Written by and Copyright (c) Ajit R. Jadhav. All rights reserved. -- Implements the case-study of the simplest possible 1D ANN. -- Each layer has only neuron. Naturally, there is only one target! -- It even may not have hidden layers. -- However, it can have an arbitrary number of hidden layers. This feature makes it a good test-bed to see why and how the neurons in the hidden layers don't learn much during deep learning, during a ``straight-forward'' application of the gradient descent algorithm. -- Please do drop me a comment or an email if you find this code useful in any way, say, in a corporate training setup or in academia. Thanks in advance! -- History: * 30 December 2018 09:27:57 IST: Project begun * 30 December 2018 11:57:44 IST: First version that works. * 01 January 2019 12:11:11 IST: Added visualizations for activations and gradient descent for the last layer, for no. of layers = 2 (i.e., no hidden layers). * 01 January 2019 18:54:36 IST: Added visualizations for percent changes in biases and weights, for no. of layers &amp;amp;amp;gt;=3 (i.e. at least one hidden layer). * 02 January 2019 08:40:17 IST: The version as initially posted on my blog. ''' import numpy as np import matplotlib.pyplot as plt ################################################################################ # Functions to generate the input and test data def GenerateDataRandom( nTrainingCases ): # Note: randn() returns samples from the normal distribution, # but rand() returns samples from the uniform distribution: [0,1). adInput = np.random.rand( nTrainingCases ) return adInput def GenerateDataSequential( nTrainingCases ): adInput = np.linspace( 0.0, 1.0, nTrainingCases ) return adInput def GenerateDataConstant( nTrainingCases, dVal ): adInput = np.full( nTrainingCases, dVal ) return adInput ################################################################################ # Functions to generate biases and weights def GenerateBiasesWeightsRandom( nLayers ): adAllBs = np.random.randn( nLayers-1 ) adAllWs = np.random.randn( nLayers-1 ) return adAllBs, adAllWs def GenerateBiasesWeightsConstant( nLayers, dB, dW ): adAllBs = np.ndarray( nLayers-1 ) adAllBs.fill( dB ) adAllWs = np.ndarray( nLayers-1 ) adAllWs.fill( dW ) return adAllBs, adAllWs ################################################################################ # Other utility functions def Sigmoid( dZ ): return 1.0 / ( 1.0 + np.exp( - dZ ) ) def SigmoidDerivative( dZ ): dA = Sigmoid( dZ ) dADer = dA * ( 1.0 - dA ) return dADer # Server function. Called with activation at the output layer. # In this script, the target value is always one and the # same, i.e., 1.0). # Assumes that the form of the cost function is: # C_x = 0.5 * ( dT - dA )^2 # where, note carefully, that target comes first. # Hence the partial derivative is: # \partial C_x / \partial dA = - ( dT - dA ) = ( dA - dT ) # where note carefully that the activation comes first. def CostDerivative( dA, dTarget ): return ( dA - dTarget ) def Transpose( dA ): np.transpose( dA ) return dA ################################################################################ # Feed-Forward def FeedForward( dA ): ## print( "\tFeed-forward" ) l_dAllZs = [] # Note, this makes l_dAllAs have one extra data member # as compared to l_dAllZs, with the first member being the # supplied activation of the input layer l_dAllAs = [ dA ] nL = 1 for w, b in zip( adAllWs, adAllBs ): dZ = w * dA + b l_dAllZs.append( dZ ) # Notice, dA has changed because it now refers # to the activation of the current layer (nL) dA = Sigmoid( dZ ) l_dAllAs.append( dA ) ## print( "\tLayer: %d, Z: %lf, A: %lf" % (nL, dZ, dA) ) nL = nL + 1 return ( l_dAllZs, l_dAllAs ) ################################################################################ # Back-Propagation def BackPropagation( l_dAllZs, l_dAllAs ): ## print( "\tBack-Propagation" ) # Step 1: For the Output Layer dZOP = l_dAllZs[ -1 ] dAOP = l_dAllAs[ -1 ] dZDash = SigmoidDerivative( dZOP ) dDelta = CostDerivative( dAOP, dTarget ) * dZDash dGradB = dDelta adAllGradBs[ -1 ] = dGradB # Since the last hidden layer has only one neuron, no need to take transpose. dAPrevTranspose = Transpose( l_dAllAs[ -2 ] ) dGradW = np.dot( dDelta, dAPrevTranspose ) adAllGradWs[ -1 ] = dGradW ## print( "\t* Layer: %d\n\t\tGradB: %lf, GradW: %lf" % (nLayers-1, dGradB, dGradW) ) # Step 2: For all the hidden layers for nL in range( 2, nLayers ): dZCur = l_dAllZs[ -nL ] dZCurDash = SigmoidDerivative( dZCur ) dWNext = adAllWs[ -nL+1 ] dWNextTranspose = Transpose( dWNext ) dDot = np.dot( dWNextTranspose, dDelta ) dDelta = dDot * dZCurDash dGradB = dDelta adAllGradBs[ -nL ] = dGradB dAPrev = l_dAllAs[ -nL-1 ] dAPrevTrans = Transpose( dAPrev ) dGradWCur = np.dot( dDelta, dAPrevTrans ) adAllGradWs[ -nL ] = dGradWCur ## print( "\tLayer: %d\n\t\tGradB: %lf, GradW: %lf" % (nLayers-nL, dGradB, dGradW) ) return ( adAllGradBs, adAllGradWs ) def PlotLayerwiseActivations( c, l_dAllAs, dTarget ): plt.subplot( 1, 2, 1 ).clear() dPredicted = l_dAllAs[ -1 ] sDesc = "Activations at Layers. Case: %3d\nPredicted: %lf, Target: %lf" % (c, dPredicted, dTarget) plt.xlabel( "Layers" ) plt.ylabel( "Activations (Input and Output)" ) plt.title( sDesc ) nLayers = len( l_dAllAs ) dES = 0.2 # Extra space, in inches plt.axis( [-dES, float(nLayers) -1.0 + dES, -dES, 1.0+dES] ) # Plot a vertical line at the input layer, just to show variations plt.plot( (0,0), (0,1), "grey" ) # Plot the dots for the input and hidden layers for i in range( nLayers-1 ): plt.plot( i, l_dAllAs[ i ], 'go' ) # Plot the dots for the output layer plt.plot( nLayers-1, dPredicted, 'bo' ) plt.plot( nLayers-1, dTarget, 'ro' ) def PlotGradDescent( c, dOrigB, dOrigW, dB, dW ): plt.subplot( 1, 2, 2 ).clear() d = 5.0 ContourSurface( d ) plt.axis( [-d, d, -d, d] ) plt.plot( dOrigB, dOrigW, 'bo' ) plt.plot( dB, dW, 'ro' ) plt.grid() plt.xlabel( "Biases" ) plt.ylabel( "Weights" ) sDesc = "Gradient Descent for the Output Layer.\n" \ "Case: %3d\nWeight: %lf, Bias: %lf" % (c, dW, dB) plt.title( sDesc ) def ContourSurface( d ): nDivs = 10 dDelta = d / nDivs w = np.arange( -d, d, dDelta ) b = np.arange( -d, d, dDelta ) W, B = np.meshgrid( w, b ) A = Sigmoid( W + B ) plt.imshow( A, interpolation='bilinear', origin='lower', cmap=plt.cm.Greys, # cmap=plt.cm.RdYlBu_r, extent=(-d, d, -d, d), alpha=0.8 ) CS = plt.contour( B, W, A ) plt.clabel( CS, inline=1, fontsize=7 ) def PlotLayerWiseBiasesWeights( c, adOrigBs, adAllBs, adOrigWs, adAllWs, dPredicted, dTarget ): plt.clf() nComputeLayers = len( adOrigBs ) plt.axis( [-0.2, nComputeLayers+0.7, -320.0, 320.0] ) adBPct = GetPercentDiff( nComputeLayers, adAllBs, adOrigBs ) adWPct = GetPercentDiff( nComputeLayers, adAllWs, adOrigWs ) print( "Case: %3d" \ "\nPercent Changes in Biases:\n%s" \ "\nPercent Changes in Weights:\n%s\n" \ % (c, adBPct, adWPct) ) adx = np.linspace( 0.0, nComputeLayers-1, nComputeLayers ) plt.plot( adx + 1.0, adWPct, 'ro' ) plt.plot( adx + 1.15, adBPct, 'bo' ) plt.grid() plt.xlabel( "Layer Number" ) plt.ylabel( "Percent Change in Weight (Red) and Bias (Blue)" ) sTitle = "How most learning occurs only at an extreme layer\n" \ "Percent Changes to Biases and Weights at Each Layer.\n" \ "Training case: %3d, Target: %lf, Predicted: %lf" % (c, dTarget, dPredicted) plt.title( sTitle ) def GetPercentDiff( n, adNow, adOrig ): adDiff = adNow - adOrig print( adDiff ) adPct = np.zeros( n ) dSmall = 1.0e-10 if all( abs( adDiff ) ) &amp;amp;amp;gt; dSmall and all( abs(adOrig) ) &amp;amp;amp;gt; dSmall: adPct = adDiff / adOrig * 100.0 return adPct ################################################################################ # The Main Script ################################################################################ dEta = 1.0 # The learning rate nTrainingCases = 100 nTestCases = nTrainingCases // 5 adInput = GenerateDataRandom( nTrainingCases ) #, 0.0 ) adTest = GenerateDataRandom( nTestCases ) np.random.shuffle( adInput ) ## print( "Data:\n %s" % (adInput) ) # Must be at least 2. Tested up to 10 layers. nLayers = 2 # Just a single target! Keep it in the interval (0.0, 1.0), # i.e., excluding both the end-points of 0.0 and 1.0. dTarget = 0.15 # The input layer has no biases or weights. Even the output layer # here has only one target, and hence, only one neuron. # Hence, the weights matrix for all layers now becomes just a # vector. # For visualization with a 2 layer-network, keep biases and weights # between [-4.0, 4.0] # adAllBs, adAllWs = GenerateBiasesWeightsRandom( nLayers ) adAllBs, adAllWs = GenerateBiasesWeightsConstant( nLayers, 2.0, 2.0 ) dOrigB = adAllBs[-1] dOrigW = adAllWs[-1] adOrigBs = adAllBs.copy() adOrigWs = adAllWs.copy() ## print( "Initial Biases\n", adAllBs ) ## print( "Initial Weights\n", adAllWs ) plt.figure( figsize=(10,5) ) # Do the training... # For each input-target pair, for c in range( nTrainingCases ): dInput = adInput[ c ] ## print( "Case: %d. Input: %lf" % (c, dInput) ) adAllGradBs = [ np.zeros( b.shape ) for b in adAllBs ] adAllGradWs = [ np.zeros( w.shape ) for w in adAllWs ] # Do the feed-forward, initialized to dA = dInput l_dAllZs, l_dAllAs = FeedForward( dInput ) # Do the back-propagation adAllGradBs, adAllGradWs = BackPropagation( l_dAllZs, l_dAllAs ) ## print( "Updating the network biases and weights" ) adAllBs = [ dB - dEta * dDeltaB for dB, dDeltaB in zip( adAllBs, adAllGradBs ) ] adAllWs = [ dW - dEta * dDeltaW for dW, dDeltaW in zip( adAllWs, adAllGradWs ) ] ## print( "The updated network biases:\n", adAllBs ) ## print( "The updated network weights:\n", adAllWs ) if 2 == nLayers: PlotLayerwiseActivations( c, l_dAllAs, dTarget ) dW = adAllWs[ -1 ] dB = adAllBs[ -1 ] PlotGradDescent( c, dOrigB, dOrigW, dB, dW ) else: # Plot in case of many layers: Original and Current Weights, Biases for all layers # and Activations for all layers dPredicted = l_dAllAs[ -1 ] PlotLayerWiseBiasesWeights( c, adOrigBs, adAllBs, adOrigWs, adAllWs, dPredicted, dTarget ) plt.pause( 0.1 ) plt.show() # Do the testing print( "\nTesting..." ) for c in range( nTestCases ): dInput = adTest[ c ] print( "\tTest Case: %d, Value: %lf" % (c, dInput) ) l_dAllZs, l_dAllAs = FeedForward( dInput ) dPredicted = l_dAllAs[ -1 ] dDiff = dTarget - dPredicted dCost = 0.5 * dDiff * dDiff print( "\tInput: %lf, Predicted: %lf, Target: %lf, Difference: %lf, Cost: %lf\n" % (dInput, dPredicted, dTarget, dDiff, dCost) ) print( "Done!" )

**Things you can try:**

- Change one or more of the following parameters, and see what happens:
- Target value
- Values of initial weights and biases
- Number of layers
- The learning rate, dEta

- Change the cost function; e.g., try the linear function instead of the Sigmoid. Change the code accordingly.
- Also, try to conceptually see what would happen when the number of neurons per layer is 2 or more…

Have fun!

**A song I like:**

(Marathi) “pahaaTe pahaaTe malaa jaag aalee”

Music and Singer: C. Ramchandra

Lyrics: Suresh Bhat