# Data science code snippets—1: My first ANN (a copy of Nielsen’s)

I took the liberty to re-type Nielsen’s[^] code [^], in the process introducing the Hungarian naming convention for variables, and also many other non-Pythonic and non-Nielsenic whimsies. Here is the end-result, for whatever worth it is. (Its sole intended purpose was to run the feed-forward and back-propagation algorithms in IDE.) Tested just a slight bit on Ubuntu Release 18.04.1 LTS (Bionic Beaver) 64-bit + Python 3.7.0 64-bit in conda 4.5.11 + MS VSCode 1.30.1.

It works. (Has to. Except for the cosmetic changes, it’s basically Nielsen’s code.) It produces this graph as an output:

Total Errors vs. Epochs

“Enjoy!”

'''
ANN.Nielsen.SmallerAndEasiestDataVectors.py
-- Simple Artificial Neural Network.
-- This code very closely follows "network.py given in Nielsen's book
(chapter 1). It was modified a bit by Ajit R. Jadhav.
-- Meant for developing understanding of ANNs through debugging in IDE
(with ability to inspect values live at run-time that also can be
verified manually, by taking *smaller* and *simpler* vectors for the input
and target data-sets).
-- Includes functions for generation of small and simple 1D input vectors
and their corresponding target vectors. The input data generated here is,
effectively, just a randomized version of Nielsen's "vectorized_result"
function.
-- Other features of this code:
* Writes the total error for each epoch, to a file.
* Plots the total error vs. epoch number.
* Follows the Hungarian naming convention. However, the variable names
may not reflect the best possible choices; this code was written in a hurry.
-- Do whatever you like with it, but 100% on your own
responsibility.
-- But if you use this code for academic or corporate
training purposes, then please do let me know by an
email or a (confidential) comment here.
-- History
* Begun: 23 Dec. 2018
* Completed first version that works: 26 Dec. 2018
* This version: Sunday 30 December 2018 11:54:52  IST
'''

import random
import numpy as np
import matplotlib.pyplot as plt

################################################################################
# Global helper functions
################################################################################

################################################################################
# Helper function. Uses numpy.random.randn() to create inputs and targets,
# whether for training or for testing. Useful for very simple 1D studies only.
# (The input data as generated here very easily predict the targets! Even just
# a single call to the softmax function would have been sufficient!!)
def GenerateData( nNeursInInput, nLenOPLayer, nCasesPerTarget ):
# Python lists for inputs and targets, returned at the end by this function.
l_mdInput = []
l_mdTarget = []
for t in range( nLenOPLayer ):
for i in range( nCasesPerTarget ):
mdInput = np.random.rand( nNeursInInput, 1 )*0.1
mdInput[t][0] = mdInput[t][0] + np.random.random_sample()*0.1 + 0.8
l_mdInput.append( mdInput )

mdTarget = np.zeros( (nLenOPLayer,1) )
mdTarget[t][0] = 1.0
l_mdTarget.append( mdTarget )
# pickle returns data as arrays; let's simulate the same thing here
amdInput = np.array( l_mdInput )
amdOutput = np.array( l_mdTarget )
# Python 3(+?) thing. Convert zip to list either here or later.
# Better to do it here.
Data = list( zip( amdInput, amdOutput ) )
return Data

################################################################################
# Helper function. Computes the sigmoid activation function
def Sigmoid( mdZ ):
return 1.0 / ( 1.0 + np.exp( - mdZ ) )

################################################################################
# Helper function. Computes the derivative of the sigmoid activation function
def SigmoidDerivative( mdZ ):
mdA = Sigmoid( mdZ )
mdADer = mdA * ( 1.0 - mdA )

################################################################################
# Helper function. Called with a single activation vector with its
# target vector. Assumes that the form of the cost function for each
# neuron is (note carefully that target comes first):
#       C_x = 0.5 * ( dT - dA )^2
# so that
#       \partial C_x / \partial dA = - ( dT - dA ) = ( dA - dT )
# (where note that the activation comes first).
def CostDerivative( mdA_OP, mdT ):
return ( mdA_OP - mdT )

################################################################################
# Class Network begins here
################################################################################

class Network(object):

############################################################################
# Constructor.
# Supply an array containing number of neurons for each layer
# e.g., [5,4,3]
def __init__(self, anNIL):
self.m_nLayers = len( anNIL )
self.m_anNIL = np.array( anNIL )
# anNIL[1:] means: An array with all entries except the first
# anNIL[:-1] means: An array with all entries except for the last
# We allocate (N,1)-shaped matrices rather than (N)-shaped vectors
# because later on in back-propagation, to have a vectorized code,
# we perform np.dot() in its tensor-product-like avatar, and there,
# we need both the operands to be matrices.
self.m_amdAllBs = [ np.random.randn( nNeursCur, 1 )
for nNeursCur in anNIL[1:] ]
self.m_amdAllWs = [ np.random.randn( nNeursCur, nNeursPrev )
for nNeursPrev, nNeursCur in
zip( anNIL[:-1], anNIL[1:] ) ]
pass

############################################################################
# For each (input, target) tuple representing a row in the
# mini-batch, perform the training. We compute the changes (delta's)
# to be effected to biases and weights through feedback and
# backpropagation, add them up until the mini-batch is over, and then
# apply them to update the weights and biases only once at the end of
# this function. Unlike in Nielsen's code, here we have a separate
# function each for feed-forward and back-propagation
def UpdateMiniBatch( self, dEta, nMiniBatchSize, MiniBatch ):

# Allocate a local matrix each for holding the GradB and GradW matrices
# for all the layers. Both are initialized to zero's. Can't directly
# use np.zeros() because these matrices are not of uniform sizes;
# the no. of neurons in each layer is arbitrarily different
l_mdErrorsMB = []

# For each (input, target) tuple representing a row in the
# mini-batch, perform the training.

for x, y in MiniBatch:

# Feed-Forward Pass: Feed the next input vector to the network
# using the existing weights and biases, and find the net-input
# and activations it produces at all the network layers.

l_mdAllZs, l_mdAllAs = self.FeedForward( x )

# Compute the error vector
mdDiff = y - l_mdAllAs[ -1 ]
mdErrors = 0.5 * mdDiff * mdDiff
l_mdErrorsMB.append( mdErrors )

# Back-Propagation Pass: Back-propagate the change in the total cost
# function (total error) to the local gradients to be applied to
# each weight (of a synaptic connection between current and previous
# layers), and to each bias (of the current layer).

# Add the changes to the local copies of the biases and weights.
# The zip function takes iterable elements as input, and returns an
# iterator for the tuple. Thus, zipping here makes it iterate through
# each layer of the network.

# Processing for all the rows in the current mini-batch is now over.
# Now, update the network data-members from the local copies.

# Note, we take an average of the changes over all rows.
dEtaAvgOverMB = dEta / nMiniBatchSize

# w means: the old weights matrix for a *single* layer only
# nw means: \sum_{i}^{all rows of mini-batch} GradW_{i-th row}
self.m_amdAllBs = [ mdB - dEtaAvgOverMB * mdGradB

self.m_amdAllWs = [ mdW - dEtaAvgOverMB * mdGradW

# Return the average error vector for this mini-batch
return l_mdErrorsMB

############################################################################
# Called for a single input vector (i.e., "x" in Nielsen's code)
def FeedForward( self, mdA ):

# Python list of all activations, layer-by-layer
l_mdAllAs = [ mdA ]
# Python list for z vectors, layer-by-layer
l_mdAllZs = []

# For the weight and bias matrices of each layer...
for mdW, mdB in zip( self.m_amdAllWs, self.m_amdAllBs ):
# Compute the net input vector to activate
# the neurons of this layer
mdZ = np.dot( mdW, mdA ) + mdB
l_mdAllZs.append( mdZ )

# Compute the activations for all the neurons
# of this layer
mdA = Sigmoid( mdZ )
l_mdAllAs.append( mdA )

return ( l_mdAllZs, l_mdAllAs )

############################################################################
# Called with inputs and activations for all the layers,
# i.e., for the entire network in one go.
def BackProp( self, l_mdAllAs, l_mdAllZs, mdTarget ):

# Allocate a local matrix each for holding the GradB and GradW matrices
# for all the layers. Both are initialized to zero's. Can't use
# np.zeros() because these matrices are not of uniform sizes;
# the no. of neurons in each layer is arbitrarily different
amdAllGradBs = [ np.zeros( mdB.shape ) for mdB in self.m_amdAllBs ]
amdAllGradWs = [ np.zeros( mdW.shape ) for mdW in self.m_amdAllWs ]

# Back-propagation occurs in two distinct stages:
# (i) In the first stage, we begin from the end (the output layer), and
# use the layer just before that, in order to update the biases and
# weights of the connections lying between the two (which are, in this
# class, are stored in the OP layer).
# (ii) In the second stage, we iterate back successively through the
# network, starting from the last hidden layer, up to the first hidden
# layer.
# The split-up into two stages is necessary because the activation of a
# neuron from the input- or hidden-layers flows through many (all)
# neurons in the immediately next layer, and the total cost function
# (i.e. the total error) is affected by *all* these paths. In contrast,
# the output layer has no next layer, and so, the activation of a
# neuron in the output layer affects the total cost function only
# through its own connection to the total cost, not that of other
# layers. So, the Delta at the OP layer can be directly
# computed from a knowledge of the OP layer activations (predictions)
# and the prescribed targets. But to compute the Delta at an
# intermediate layer, we do need the Delta for the immediately
# next layer.

##########

# The OP layer is the last, accessed through the -1 index. Get the
# activations (A's) on its output side, and the net inputs (Z's) on its
# input side.
mdA_OP = l_mdAllAs[ -1 ]
mdZ_OP = l_mdAllZs[ -1 ]

# Compute the partial derivatives, and use them to find the Grad for
# all B's in the OP layer.
mdZ_OP_Der = SigmoidDerivative( mdZ_OP )
# As an intermediate product, find the delta for the OP layer. It is
# subsequently used in the second stage as an input, thereby
# back-propagating the changes to the total cost function back
# throughout the network.
mdDelta = CostDerivative( mdA_OP, mdTarget ) * mdZ_OP_Der
# Compute the Grad for all B's in the OP layer

# Compute the partial derivatives, and find the Grad for all W's in
# the layer just before the OP layer. Index -2 means: just before OP.
# We use "LHL" to mean the last hidden layer.
mdA_LHL_Transpose = l_mdAllAs[ -2 ].transpose()
# Here, we build the GradW *matrix* from what essentially are two
# vectors (or (N,1) matrices). Even if the numpy function says
# "dot", it actually works like the tensor product---it outputs an
# (N X N) matrix, not a single number as a scalar. It is for this step
# that we made even B's an (N,1) shaped matrix in the constructor.
amdAllGradWs[ -1 ] = np.dot( mdDelta, mdA_LHL_Transpose )

###########
# Stage II: Compute mdGradB and mdGradB for the last hidden layer
#           and go back, all the way up to the first hidden layer

# Note that only the negative of the index ever gets used in the
# following loop.
# a[-2] means the second-last layer (i.e. the last hidden layer).
# a[-1] means the last layer (i.e., the OP layer).
# a[-2-1] means the layer just before the second-last layer.
# The current layer starts from the output layer. The previous
# layer is the one before the current.
# If the indexing gets confusing, notice that Nielsen uses the range
# (2, nLayers), rather than that of (1, nLayers-1) which would have
# been a bit more natural, at least to me. But of course, his choice
# is slightly more consistent from a mathematical viewpoint---and
# from the fact that there are no weights and biases for the IP
# layer.
for nL in range( 2, self.m_nLayers ):
# Find the mdDelta for the previous layer, using that for the
# current layer...
mdZCur = l_mdAllZs[ -nL ]
mdZCurDer = SigmoidDerivative( mdZCur )
mdWNextTrans = self.m_amdAllWs[ -nL+1 ].transpose()
# Note, for the very first pass through the loop, the mdDelta being
# used on the RHS is what was earlier computed in the Stage I given
# above (viz. for the output layer). It now gets updated here, so
# that it can act as the "previous" mdDelta in the next pass
# through this loop.
mdDelta = np.dot( mdWNextTrans, mdDelta ) * mdZCurDer
# mdDelta now refers to the current layer, not to the next

mdAPrevTrans = l_mdAllAs[ -nL-1 ].transpose()
amdAllGradWs[ -nL ] = np.dot( mdDelta, mdAPrevTrans )

# Explaining the above four lines is left as an exercise
# for the reader. (Hints: (i) Write out the matrix equation
# "straight," i.e., as used in the feed-forward pass. (ii)
# Use the rules for the matrix multiplications and the rules
# for taking transposes.)

############################################################################
# Implements the stochastic gradient descent algorithm
def SGD( self, sErrorsFileName, nEpochs,
TrainingData, nMiniBatchSize, TestData ):

# This variable is used only for the x-axis during plotting the
# total activation errors at the OP layer.

# To have a small file-size, we persist the total error for the
# output layer only as averaged over the entire epoch (i.e., the
# entire training data, randomly shuffled).
foErrors = open( sErrorsFileName, 'w' )

nTrainingCases = len( TrainingData )

# An epoch is defined for the entire training data. However, each epoch
# begins by randomly shuffling the training data. Effectively, we
# change the initial position from which to begin descending down the
# higher-dimensional total-cost-function surface.
for e in range( nEpochs ):

# Create Mini-Batches covering the entire training data
np.random.shuffle( TrainingData )
MiniBatches = [ TrainingData[ k : k + nMiniBatchSize ]
for k in range( 0, nTrainingCases, nMiniBatchSize ) ]

dTotalErrorEpoch = 0
# For each mini-batch
for MiniBatch in MiniBatches:
# Conduct the training over the entire mini-batch,
# and collect the errors accumulated over it. Add them
# to the accumulated error for the current epoch.
l_mdErrorsMB = self.UpdateMiniBatch( dEta, nMiniBatchSize, MiniBatch )
dAvgErrorMB = self.ComputeAvgErrorForMiniBatch( l_mdErrorsMB )
dTotalErrorEpoch = dTotalErrorEpoch + dAvgErrorMB

dAvgErrorEpoch = dTotalErrorEpoch / nMiniBatchSize
# Write to file
sLine = "%E\n" % (dAvgErrorEpoch)
foErrors.write( sLine )

# self.Evaluate( TestData )
print( "Epoch %d: Avg. Error: %lf" % (e, dAvgErrorEpoch) )

foErrors.close()

plt.xlabel( "Epochs" )
plt.ylabel( "Total Error at the OP Layer\n(Epoch Avg. of Mini-Batch Avgs.)" )
plt.title( "Training Errors" )
plt.show()

############################################################################
# Helper function
def ComputeAvgErrorForMiniBatch( self, l_mdErrorsMB ):
nSize = len( l_mdErrorsMB )
dTotalErrorMB = 0
# For each training case in this mini-batch
for i in range( nSize ):
# Get the error vector for the i-th training case
mdE = l_mdErrorsMB[ i ]
dTotalErrorCase = mdE.sum()
dTotalErrorMB = dTotalErrorMB + dTotalErrorCase
# The average of the total errors for all cases in the current mini-batch
dAvgErrorMB = dTotalErrorMB / nSize
return dAvgErrorMB

############################################################################
# This function is mostly useless in this simple a scenario; it predicts
# 100 % accuracy right from 1st epoch! Obvious. With the kind of input
# data we generate, the prediction would have been 100 % accurate even
# with a single function call to softmax!
#
# def Evaluate( self, TestData ):
#     nTestCases = len( TestData )
#     anPredictedRes = []
#     for x, y in TestData:
#         l_mdAllZs, l_mdAllAs = self.FeedForward( x )

#         mdActivOP = l_mdAllAs[ -1 ]
#         nPredIdx = np.argmax( mdActivOP )
#         nTargetIdx = np.argmax( y )
#         anPredictedRes.append( int( nPredIdx == nTargetIdx ) )
#     print( anPredictedRes )
#     pass
################################################################################
# Class Network ends here
################################################################################

################################################################################
# main script
################################################################################

# Prepare Input Data

# The learning rate
dEta = 0.5

# Number of Neurons contained In each successive Layer
anNIL = [10,10,10]

# Size of the input layer
nNeursInInput = anNIL[ 0 ]
# Size of the target layer
nTargets = anNIL[ -1 ]
# Total number of cases to have for training
nTotalNumCasesForTraining = 10000
# Total number of cases to have for testing
nTotalNumCasesForTesting = nTotalNumCasesForTraining // 5

# Number of cases to generate for each target for training
nCasesPerTargetForTraining = nTotalNumCasesForTraining // nTargets
# Number of cases to generate for each target for testing
nCasesPerTargetForTesting = nTotalNumCasesForTesting // nTargets

# For repeatability, seed the RNG in NumPy. For "true" random-ness
# at each program invocation, comment out the next line. Notice that
# seeding, if any, must occur before generating data.
np.random.seed( 0 )

TrainingData = GenerateData( anNIL[ 0 ], anNIL[ -1 ], nCasesPerTargetForTraining )
TestingData = GenerateData( anNIL[ 0 ], anNIL[ -1 ], nCasesPerTargetForTesting )

# Number of Epochs
nEpochs = 10
# Mini-Batch Size
nMiniBatchSize = 10

# Instantiate the network
# Optionally, we can have a re-seeding also here
# np.random.seed( 10000 )
nn = Network( anNIL )
# Let the network find the most optimum combination of values
# for all the weights and biases it contains.
nn.SGD( "EpochErrors.txt", nEpochs, TrainingData, nMiniBatchSize, TestingData )

print( "Done!" )



Things for you to add and/or to try:

• Add the code to persist the network (weights and biases) to a plain-text CSV file, and to re-init it from that file.
• Add the code to show a simple (real-time) diagrammatic visualization of the training process for a tiny (say [4,3,2]) network, using a colormap-based shades being used for the lines of the synaptic weights and similarly colormap-based circles for neuron background for biases.
• Write a code for a more ambitious code for generation of input data for n-dimensional data. It would have `m’ number of islands of volumes (of simplest topology) of randomly varying dark shades which are embedded into a randomly varying background of light shades. That is, it would be an n-dimensional generalization of taking the Windows Paint (or similar) program, having a nonuniform but light-gray fill-color, with dark-shaded blobs of arbitrary shapes randomly placed in it, sort of like a two-phase materials micro-structure. You have to take a configuration of such blobs and introduce random but small variations into its shape and position too. Then, take this small group of variations and assign a target value to it. Repeat for N number of targets.

No songs section this time. Will be back on 31st/1st, as promised (see the post just below).

A song I like:

[Realizing that you wouldn’t be in a mood to listen to (and be able to hum) a neat, melodious, golden oldie, may as well run it right away. … Yes, I will come back with my new year’s resolutions on the 31st/1st (see the post just below), but it’s likely that I won’t run any songs section at that time.]

(Hindi) “dhaDakne lage dil ke taron ki duniya…”
Singers: Mahendra Kapoor, Asha Bhosale
Music: N. Dutta
Lyrics: Sahir Ludhianvi

….Done. Take care and bye for now. See you on the 31st, with the new post…

History:
First published: 2018.12.27 23:29 IST
Updated (also the code): 2018.12.28 09.08 IST

This site uses Akismet to reduce spam. Learn how your comment data is processed.