# Data science code snippets—2: Using world’s simplest ANN to visualize that deep learning is hard

Good morning! [LOL!]

In the context of machine learning, “learning” only means: changes that get effected to the weights and biases of a network. Nothing more, nothing less.

I wanted to understand exactly how the learning propagates through the layers of a network. So, I wrote the following piece of code. Actually, the development of the code went through two distinct stages.

First stage: A two-layer network (without any hidden layer); only one neuron per layer:

In the first stage, I wrote what demonstrably is world’s simplest possible artificial neural network. This network comes with only two layers: the input layer, and the output layer. Thus, there is no hidden layer at all.

The advantage with such a network (“linear” or “1D” and without hidden layer) is that the total error at the output layer is a function of just one weight and one bias. Thus the total error (or the cost function) is a function of only two independent variables. Therefore, the cost function-surface can (at all) be directly visualized or plotted. For ease of coding, I plotted this function as a contour plot, but a 3D plot also should be possible.

Here are couple of pictures it produces. The input randomly varies in the interval [0.0, 1.0) (i.e., excluding 1.0), and the target is kept as 0.15.

The following plot shows an intermittent stage during training, a stage when the gradient descent algorithm is still very much in progress. SImplest network with just input and output layers, with only one neuron per layer: Gradient descent in progress.

The following plot shows when the gradient descent has taken the configuration very close to the local (and global) minimum: SImplest network with just input and output layers, with only one neuron per layer: Gradient descent near the local (and global) minimum.

Second stage: An n-layer network (with many hidden layers); only one neuron per layer

In the second stage, I wanted to show that even if you add one or more hidden layers, the gradient descent algorithm works in such a way that most of the learning occurs only near the output layer.

So, I generalized the code a bit to have an arbitrary number of hidden layers. However, the network continues to have only one neuron per layer (i.e., it maintains the topology of the bogies of a train). I then added a visualization showing the percent changes in the biases and weights at each layer, as learning progresses.

Here is a representative picture it produces when the total number of layers is 5 (i.e. when there are 3 hidden layers). It was made with both the biases and the weights all being set to the value of 2.0: It is clearly seen that almost all of learning is limited only to the output layer; the hidden layers have learnt almost nothing!

Now, by way of a contrast, here is what happens when you have all initial biases of 0.0, and all initial weights of 2.0: Here, the last hidden layer has begun learning enough that it shows some visible change during training, even though the output layer learns much more than does the last hidden layer. Almost all learning is via changes to weights, not biases.

Next, here is the reverse situation: when you have all initial biases of 2.0, but all initial weights of 0.0: The bias of the hidden layer does undergo a slight change, but in an opposite (positive) direction. Compared to the case just above, the learning now is relatively more concentrated in the output layer.

Finally, here is what happens when you initialize both biases and weights to 0.0. The network does learn (the difference in the predicted vs. target does go on diminishing as the training progresses). However, the percentage change is too small to be visually registered (when plotted to the same scale as what was used earlier).

The code:

Here is the code which produced all the above plots (but you have to suitably change the hard-coded parameters to get to each of the above cases):

'''
SimplestANNInTheWorld.py
-- Implements the case-study of the simplest possible 1D ANN.
-- Each layer has only neuron. Naturally, there is only one target!
-- It even may not have hidden layers.
-- However, it can have an arbitrary number of hidden layers. This
feature makes it a good test-bed to see why and how the neurons
in the hidden layers don't learn much during deep learning, during
a straight-forward'' application of the gradient descent algorithm.
-- Please do drop me a comment or an email
if you find this code useful in any way,
say, in a corporate training setup or
-- History:
* 30 December 2018 09:27:57  IST:
Project begun
* 30 December 2018 11:57:44  IST:
First version that works.
* 01 January 2019 12:11:11  IST:
last layer, for no. of layers = 2 (i.e., no hidden layers).
* 01 January 2019 18:54:36  IST:
Added visualizations for percent changes in biases and
weights, for no. of layers &amp;amp;amp;amp;gt;=3 (i.e. at least one hidden layer).
* 02 January 2019 08:40:17  IST:
The version as initially posted on my blog.
'''
import numpy as np
import matplotlib.pyplot as plt

################################################################################
# Functions to generate the input and test data

def GenerateDataRandom( nTrainingCases ):
# Note: randn() returns samples from the normal distribution,
# but rand() returns samples from the uniform distribution: [0,1).

def GenerateDataSequential( nTrainingCases ):
adInput = np.linspace( 0.0, 1.0, nTrainingCases )

def GenerateDataConstant( nTrainingCases, dVal ):
adInput = np.full( nTrainingCases, dVal )

################################################################################
# Functions to generate biases and weights

def GenerateBiasesWeightsRandom( nLayers ):

def GenerateBiasesWeightsConstant( nLayers, dB, dW ):

################################################################################
# Other utility functions

def Sigmoid( dZ ):
return 1.0 / ( 1.0 + np.exp( - dZ ) )

def SigmoidDerivative( dZ ):
dA = Sigmoid( dZ )
dADer = dA * ( 1.0 - dA )

# Server function. Called with activation at the output layer.
# In this script, the target value is always one and the
# same, i.e., 1.0).
# Assumes that the form of the cost function is:
#       C_x = 0.5 * ( dT - dA )^2
# where, note carefully, that target comes first.
# Hence the partial derivative is:
#       \partial C_x / \partial dA = - ( dT - dA ) = ( dA - dT )
# where note carefully that the activation comes first.
def CostDerivative( dA, dTarget ):
return ( dA - dTarget )

def Transpose( dA ):
np.transpose( dA )
return dA

################################################################################
# Feed-Forward

def FeedForward( dA ):
## print( "\tFeed-forward" )
l_dAllZs = []
# Note, this makes l_dAllAs have one extra data member
# as compared to l_dAllZs, with the first member being the
# supplied activation of the input layer
l_dAllAs = [ dA ]
nL = 1
dZ = w * dA + b
l_dAllZs.append( dZ )
# Notice, dA has changed because it now refers
# to the activation of the current layer (nL)
dA = Sigmoid( dZ )
l_dAllAs.append( dA )
## print( "\tLayer: %d, Z: %lf, A: %lf" % (nL, dZ, dA) )
nL = nL + 1
return ( l_dAllZs, l_dAllAs )

################################################################################
# Back-Propagation

def BackPropagation( l_dAllZs, l_dAllAs ):
## print( "\tBack-Propagation" )
# Step 1: For the Output Layer
dZOP = l_dAllZs[ -1 ]
dAOP = l_dAllAs[ -1 ]
dZDash = SigmoidDerivative( dZOP )
dDelta = CostDerivative( dAOP, dTarget ) * dZDash

# Since the last hidden layer has only one neuron, no need to take transpose.
dAPrevTranspose = Transpose( l_dAllAs[ -2 ] )
dGradW = np.dot( dDelta, dAPrevTranspose )

# Step 2: For all the hidden layers
for nL in range( 2, nLayers ):
dZCur = l_dAllZs[ -nL ]
dZCurDash = SigmoidDerivative( dZCur )
dWNextTranspose = Transpose( dWNext )
dDot = np.dot( dWNextTranspose, dDelta )
dDelta = dDot * dZCurDash

dAPrev = l_dAllAs[ -nL-1 ]
dAPrevTrans = Transpose( dAPrev )
dGradWCur = np.dot( dDelta, dAPrevTrans )

def PlotLayerwiseActivations( c, l_dAllAs, dTarget ):
plt.subplot( 1, 2, 1 ).clear()
dPredicted = l_dAllAs[ -1 ]
sDesc = "Activations at Layers. Case: %3d\nPredicted: %lf, Target: %lf" % (c, dPredicted, dTarget)
plt.xlabel( "Layers" )
plt.ylabel( "Activations (Input and Output)" )
plt.title( sDesc )

nLayers = len( l_dAllAs )
dES = 0.2	# Extra space, in inches
plt.axis( [-dES, float(nLayers) -1.0 + dES, -dES, 1.0+dES] )

# Plot a vertical line at the input layer, just to show variations
plt.plot( (0,0), (0,1), "grey" )

# Plot the dots for the input and hidden layers
for i in range( nLayers-1 ):
plt.plot( i, l_dAllAs[ i ], 'go' )
# Plot the dots for the output layer
plt.plot( nLayers-1, dPredicted, 'bo' )
plt.plot( nLayers-1, dTarget, 'ro' )

def PlotGradDescent( c, dOrigB, dOrigW, dB, dW ):
plt.subplot( 1, 2, 2 ).clear()

d = 5.0
ContourSurface( d )
plt.axis( [-d, d, -d, d] )
plt.plot( dOrigB, dOrigW, 'bo' )
plt.plot( dB, dW, 'ro' )
plt.grid()
plt.xlabel( "Biases" )
plt.ylabel( "Weights" )
sDesc = "Gradient Descent for the Output Layer.\n" \
"Case: %3d\nWeight: %lf, Bias: %lf" % (c, dW, dB)
plt.title( sDesc )

def ContourSurface( d ):
nDivs = 10
dDelta = d / nDivs
w = np.arange( -d, d, dDelta )
b = np.arange( -d, d, dDelta )
W, B = np.meshgrid( w, b )
A = Sigmoid( W + B )
plt.imshow( A, interpolation='bilinear', origin='lower',
cmap=plt.cm.Greys, # cmap=plt.cm.RdYlBu_r,
extent=(-d, d, -d, d), alpha=0.8 )
CS = plt.contour( B, W, A )
plt.clabel( CS, inline=1, fontsize=7 )

plt.clf()

plt.axis( [-0.2, nComputeLayers+0.7, -320.0, 320.0] )

print( "Case: %3d" \
"\nPercent Changes in Biases:\n%s" \
"\nPercent Changes in Weights:\n%s\n" \
adx = np.linspace( 0.0, nComputeLayers-1, nComputeLayers )
plt.grid()
plt.xlabel( "Layer Number" )
plt.ylabel( "Percent Change in Weight (Red) and Bias (Blue)" )
sTitle = "How most learning occurs only at an extreme layer\n" \
"Percent Changes to Biases and Weights at Each Layer.\n" \
"Training case: %3d, Target: %lf, Predicted: %lf" % (c, dTarget, dPredicted)
plt.title( sTitle )

dSmall = 1.0e-10
if all( abs( adDiff ) ) &amp;amp;amp;amp;gt; dSmall and all( abs(adOrig) ) &amp;amp;amp;amp;gt; dSmall:

################################################################################
# The Main Script
################################################################################

dEta = 1.0 # The learning rate
nTrainingCases = 100
nTestCases = nTrainingCases // 5
adInput = GenerateDataRandom( nTrainingCases ) #, 0.0 )

## print( "Data:\n %s" % (adInput) )

# Must be at least 2. Tested up to 10 layers.
nLayers = 2
# Just a single target! Keep it in the interval (0.0, 1.0),
# i.e., excluding both the end-points of 0.0 and 1.0.

dTarget = 0.15

# The input layer has no biases or weights. Even the output layer
# here has only one target, and hence, only one neuron.
# Hence, the weights matrix for all layers now becomes just a
# vector.
# For visualization with a 2 layer-network, keep biases and weights
# between [-4.0, 4.0]

## print( "Initial Biases\n", adAllBs )
## print( "Initial Weights\n", adAllWs )

plt.figure( figsize=(10,5) )

# Do the training...
# For each input-target pair,
for c in range( nTrainingCases ):
## print( "Case: %d. Input: %lf" % (c, dInput) )

# Do the feed-forward, initialized to dA = dInput
l_dAllZs, l_dAllAs = FeedForward( dInput )

# Do the back-propagation

## print( "Updating the network biases and weights" )
adAllBs = [ dB - dEta * dDeltaB
adAllWs = [ dW - dEta * dDeltaW

## print( "The updated network biases:\n", adAllBs )
## print( "The updated network weights:\n", adAllWs )

if 2 == nLayers:
PlotLayerwiseActivations( c, l_dAllAs, dTarget )
PlotGradDescent( c, dOrigB, dOrigW, dB, dW )
else:
# Plot in case of many layers: Original and Current Weights, Biases for all layers
# and Activations for all layers
dPredicted = l_dAllAs[ -1 ]
plt.pause( 0.1 )

plt.show()

# Do the testing
print( "\nTesting..." )
for c in range( nTestCases ):

print( "\tTest Case: %d, Value: %lf" % (c, dInput) )

l_dAllZs, l_dAllAs = FeedForward( dInput )
dPredicted = l_dAllAs[ -1 ]
dDiff = dTarget - dPredicted
dCost = 0.5 * dDiff * dDiff
print( "\tInput: %lf, Predicted: %lf, Target: %lf, Difference: %lf, Cost: %lf\n" % (dInput, dPredicted, dTarget, dDiff, dCost) )

print( "Done!" )



Things you can try:

• Change one or more of the following parameters, and see what happens:
• Target value
• Values of initial weights and biases
• Number of layers
• The learning rate, dEta
• Change the cost function; e.g., try the linear function instead of the Sigmoid. Change the code accordingly.
• Also, try to conceptually see what would happen when the number of neurons per layer is 2 or more…

Have fun!

A song I like:

(Marathi) “pahaaTe pahaaTe malaa jaag aalee”
Music and Singer: C. Ramchandra
Lyrics: Suresh Bhat

This site uses Akismet to reduce spam. Learn how your comment data is processed.