Data science code snippets—1: My first ANN (a copy of Nielsen’s)

I took the liberty to re-type Nielsen’s[^] code [^], in the process introducing the Hungarian naming convention for variables, and also many other non-Pythonic and non-Nielsenic whimsies. Here is the end-result, for whatever worth it is. (Its sole intended purpose was to run the feed-forward and back-propagation algorithms in IDE.) Tested just a slight bit on Ubuntu Release 18.04.1 LTS (Bionic Beaver) 64-bit + Python 3.7.0 64-bit in conda 4.5.11 + MS VSCode 1.30.1.

It works. (Has to. Except for the cosmetic changes, it’s basically Nielsen’s code.) It produces this graph as an output:

Total Errors vs. Epochs

“Enjoy!”

'''
ANN.Nielsen.SmallerAndEasiestDataVectors.py
-- Simple Artificial Neural Network.
-- This code very closely follows "network.py given in Nielsen's book
(chapter 1). It was modified a bit by Ajit R. Jadhav. 
This code Copyright (c) Ajit R. Jadhav. All rights reserved.
-- Meant for developing understanding of ANNs through debugging in IDE 
(with ability to inspect values live at run-time that also can be 
verified manually, by taking *smaller* and *simpler* vectors for the input 
and target data-sets).
-- Includes functions for generation of small and simple 1D input vectors 
and their corresponding target vectors. The input data generated here is, 
effectively, just a randomized version of Nielsen's "vectorized_result" 
function.
-- Other features of this code:
* Writes the total error for each epoch, to a file.
* Plots the total error vs. epoch number. 
* Follows the Hungarian naming convention. However, the variable names 
may not reflect the best possible choices; this code was written in a hurry. 
-- Do whatever you like with it, but 100% on your own 
responsibility. 
-- But if you use this code for academic or corporate 
training purposes, then please do let me know by an 
email or a (confidential) comment here. 
Thanks in advance!
-- History
* Begun: 23 Dec. 2018 
* Completed first version that works: 26 Dec. 2018 
* This version: Sunday 30 December 2018 11:54:52  IST 
'''

import random
import numpy as np 
import matplotlib.pyplot as plt 

################################################################################
# Global helper functions
################################################################################

################################################################################ 
# Helper function. Uses numpy.random.randn() to create inputs and targets, 
# whether for training or for testing. Useful for very simple 1D studies only. 
# (The input data as generated here very easily predict the targets! Even just 
# a single call to the softmax function would have been sufficient!!)
def GenerateData( nNeursInInput, nLenOPLayer, nCasesPerTarget ):
    # Python lists for inputs and targets, returned at the end by this function. 
    l_mdInput = []
    l_mdTarget = []
    for t in range( nLenOPLayer ):
        for i in range( nCasesPerTarget ):
            mdInput = np.random.rand( nNeursInInput, 1 )*0.1
            mdInput[t][0] = mdInput[t][0] + np.random.random_sample()*0.1 + 0.8
            l_mdInput.append( mdInput )
            
            mdTarget = np.zeros( (nLenOPLayer,1) ) 
            mdTarget[t][0] = 1.0
            l_mdTarget.append( mdTarget )
    # pickle returns data as arrays; let's simulate the same thing here
    amdInput = np.array( l_mdInput )
    amdOutput = np.array( l_mdTarget )
    # Python 3(+?) thing. Convert zip to list either here or later.
    # Better to do it here. 
    Data = list( zip( amdInput, amdOutput ) )
    return Data
   
################################################################################ 
# Helper function. Computes the sigmoid activation function
def Sigmoid( mdZ ):
    return 1.0 / ( 1.0 + np.exp( - mdZ ) )

################################################################################ 
# Helper function. Computes the derivative of the sigmoid activation function
def SigmoidDerivative( mdZ ):
    mdA = Sigmoid( mdZ )
    mdADer = mdA * ( 1.0 - mdA )
    return mdADer

################################################################################ 
# Helper function. Called with a single activation vector with its 
# target vector. Assumes that the form of the cost function for each 
# neuron is (note carefully that target comes first): 
#       C_x = 0.5 * ( dT - dA )^2 
# so that  
#       \partial C_x / \partial dA = - ( dT - dA ) = ( dA - dT ) 
# (where note that the activation comes first).
def CostDerivative( mdA_OP, mdT ):
    return ( mdA_OP - mdT ) 


################################################################################
# Class Network begins here
################################################################################

class Network(object):

    ############################################################################
    # Constructor. 
    # Supply an array containing number of neurons for each layer
    # e.g., [5,4,3]  
    def __init__(self, anNIL):
        self.m_nLayers = len( anNIL )
        self.m_anNIL = np.array( anNIL )
        # anNIL[1:] means: An array with all entries except the first
        # anNIL[:-1] means: An array with all entries except for the last
        # We allocate (N,1)-shaped matrices rather than (N)-shaped vectors
        # because later on in back-propagation, to have a vectorized code,
        # we perform np.dot() in its tensor-product-like avatar, and there,
        # we need both the operands to be matrices. 
        self.m_amdAllBs = [ np.random.randn( nNeursCur, 1 ) 
                            for nNeursCur in anNIL[1:] ] 
        self.m_amdAllWs = [ np.random.randn( nNeursCur, nNeursPrev ) 
                            for nNeursPrev, nNeursCur in
                                zip( anNIL[:-1], anNIL[1:] ) ] 
        pass

    ############################################################################
    # For each (input, target) tuple representing a row in the 
    # mini-batch, perform the training. We compute the changes (delta's) 
    # to be effected to biases and weights through feedback and 
    # backpropagation, add them up until the mini-batch is over, and then
    # apply them to update the weights and biases only once at the end of 
    # this function. Unlike in Nielsen's code, here we have a separate 
    # function each for feed-forward and back-propagation
    def UpdateMiniBatch( self, dEta, nMiniBatchSize, MiniBatch ):

        # Allocate a local matrix each for holding the GradB and GradW matrices
        # for all the layers. Both are initialized to zero's. Can't directly 
        # use np.zeros() because these matrices are not of uniform sizes; 
        # the no. of neurons in each layer is arbitrarily different
        amdAllGradBs = [ np.zeros( mdGradB.shape ) for mdGradB in self.m_amdAllBs ]
        amdAllGradWs = [ np.zeros( mdGradW.shape ) for mdGradW in self.m_amdAllWs ]
        l_mdErrorsMB = []

        # For each (input, target) tuple representing a row in the 
        # mini-batch, perform the training. 
        
        for x, y in MiniBatch:
            
            # Feed-Forward Pass: Feed the next input vector to the network 
            # using the existing weights and biases, and find the net-input 
            # and activations it produces at all the network layers.
            
            l_mdAllZs, l_mdAllAs = self.FeedForward( x ) 
            
            # Compute the error vector
            mdDiff = y - l_mdAllAs[ -1 ]
            mdErrors = 0.5 * mdDiff * mdDiff
            l_mdErrorsMB.append( mdErrors )
            
            # Back-Propagation Pass: Back-propagate the change in the total cost 
            # function (total error) to the local gradients to be applied to 
            # each weight (of a synaptic connection between current and previous
            # layers), and to each bias (of the current layer). 
            amdDeltaGradB, amdDeltaGradW = self.BackProp( l_mdAllAs, l_mdAllZs, y )
            
            # Add the changes to the local copies of the biases and weights. 
            # The zip function takes iterable elements as input, and returns an 
            # iterator for the tuple. Thus, zipping here makes it iterate through 
            # each layer of the network.
            amdAllGradBs = [ mdGradB + mdDeltaGradB 
                        for mdGradB, mdDeltaGradB in zip( amdAllGradBs, amdDeltaGradB ) ]

            amdAllGradWs = [ mdGradW + mdDeltaGradW 
                        for mdGradW, mdDeltaGradW in zip( amdAllGradWs, amdDeltaGradW ) ]
        # Processing for all the rows in the current mini-batch is now over. 
        # Now, update the network data-members from the local copies.
         
        # Note, we take an average of the changes over all rows.
        dEtaAvgOverMB = dEta / nMiniBatchSize

        # w means: the old weights matrix for a *single* layer only
        # nw means: \sum_{i}^{all rows of mini-batch} GradW_{i-th row}
        self.m_amdAllBs = [ mdB - dEtaAvgOverMB * mdGradB 
            for mdB, mdGradB in zip( self.m_amdAllBs, amdAllGradBs ) ]

        self.m_amdAllWs = [ mdW - dEtaAvgOverMB * mdGradW 
            for mdW, mdGradW in zip( self.m_amdAllWs, amdAllGradWs ) ]

        # Return the average error vector for this mini-batch
        return l_mdErrorsMB

    ############################################################################
    # Called for a single input vector (i.e., "x" in Nielsen's code)
    def FeedForward( self, mdA ):

        # Python list of all activations, layer-by-layer 
        l_mdAllAs = [ mdA ]
        # Python list for z vectors, layer-by-layer
        l_mdAllZs = []

        # For the weight and bias matrices of each layer... 
        for mdW, mdB in zip( self.m_amdAllWs, self.m_amdAllBs ):
            # Compute the net input vector to activate 
            # the neurons of this layer
            mdZ = np.dot( mdW, mdA ) + mdB
            l_mdAllZs.append( mdZ )
            
            # Compute the activations for all the neurons
            # of this layer 
            mdA = Sigmoid( mdZ )
            l_mdAllAs.append( mdA )

        return ( l_mdAllZs, l_mdAllAs )

    ############################################################################
    # Called with inputs and activations for all the layers, 
    # i.e., for the entire network in one go. 
    def BackProp( self, l_mdAllAs, l_mdAllZs, mdTarget ):

        # Allocate a local matrix each for holding the GradB and GradW matrices
        # for all the layers. Both are initialized to zero's. Can't use 
        # np.zeros() because these matrices are not of uniform sizes; 
        # the no. of neurons in each layer is arbitrarily different
        amdAllGradBs = [ np.zeros( mdB.shape ) for mdB in self.m_amdAllBs ]
        amdAllGradWs = [ np.zeros( mdW.shape ) for mdW in self.m_amdAllWs ]

        # Back-propagation occurs in two distinct stages:
        # (i) In the first stage, we begin from the end (the output layer), and
        # use the layer just before that, in order to update the biases and 
        # weights of the connections lying between the two (which are, in this 
        # class, are stored in the OP layer). 
        # (ii) In the second stage, we iterate back successively through the 
        # network, starting from the last hidden layer, up to the first hidden 
        # layer.
        # The split-up into two stages is necessary because the activation of a
        # neuron from the input- or hidden-layers flows through many (all) 
        # neurons in the immediately next layer, and the total cost function 
        # (i.e. the total error) is affected by *all* these paths. In contrast, 
        # the output layer has no next layer, and so, the activation of a 
        # neuron in the output layer affects the total cost function only 
        # through its own connection to the total cost, not that of other 
        # layers. So, the Delta at the OP layer can be directly 
        # computed from a knowledge of the OP layer activations (predictions) 
        # and the prescribed targets. But to compute the Delta at an 
        # intermediate layer, we do need the Delta for the immediately 
        # next layer.  

        ##########
        # Stage I: Compute mdGradA and mdGradB for the Output Layer
        
        # The OP layer is the last, accessed through the -1 index. Get the 
        # activations (A's) on its output side, and the net inputs (Z's) on its 
        # input side.
        mdA_OP = l_mdAllAs[ -1 ] 
        mdZ_OP = l_mdAllZs[ -1 ] 
        
        # Compute the partial derivatives, and use them to find the Grad for 
        # all B's in the OP layer.  
        mdZ_OP_Der = SigmoidDerivative( mdZ_OP )
        # As an intermediate product, find the delta for the OP layer. It is
        # subsequently used in the second stage as an input, thereby 
        # back-propagating the changes to the total cost function back 
        # throughout the network. 
        mdDelta = CostDerivative( mdA_OP, mdTarget ) * mdZ_OP_Der
        # Compute the Grad for all B's in the OP layer
        amdAllGradBs[ -1 ] = mdDelta
        
        # Compute the partial derivatives, and find the Grad for all W's in 
        # the layer just before the OP layer. Index -2 means: just before OP.
        # We use "LHL" to mean the last hidden layer. 
        mdA_LHL_Transpose = l_mdAllAs[ -2 ].transpose()
        # Here, we build the GradW *matrix* from what essentially are two 
        # vectors (or (N,1) matrices). Even if the numpy function says
        # "dot", it actually works like the tensor product---it outputs an 
        # (N X N) matrix, not a single number as a scalar. It is for this step
        # that we made even B's an (N,1) shaped matrix in the constructor. 
        amdAllGradWs[ -1 ] = np.dot( mdDelta, mdA_LHL_Transpose )
        
        ###########
        # Stage II: Compute mdGradB and mdGradB for the last hidden layer
        #           and go back, all the way up to the first hidden layer 

        # Note that only the negative of the index ever gets used in the 
        # following loop. 
        # a[-2] means the second-last layer (i.e. the last hidden layer).
        # a[-1] means the last layer (i.e., the OP layer).
        # a[-2-1] means the layer just before the second-last layer.
        # The current layer starts from the output layer. The previous
        # layer is the one before the current. 
        # If the indexing gets confusing, notice that Nielsen uses the range 
        # (2, nLayers), rather than that of (1, nLayers-1) which would have 
        # been a bit more natural, at least to me. But of course, his choice 
        # is slightly more consistent from a mathematical viewpoint---and 
        # from the fact that there are no weights and biases for the IP 
        # layer.   
        for nL in range( 2, self.m_nLayers ):
            # Find the mdDelta for the previous layer, using that for the
            # current layer...  
            mdZCur = l_mdAllZs[ -nL ]
            mdZCurDer = SigmoidDerivative( mdZCur )
            mdWNextTrans = self.m_amdAllWs[ -nL+1 ].transpose()
            # Note, for the very first pass through the loop, the mdDelta being 
            # used on the RHS is what was earlier computed in the Stage I given 
            # above (viz. for the output layer). It now gets updated here, so 
            # that it can act as the "previous" mdDelta in the next pass 
            # through this loop.
            mdDelta = np.dot( mdWNextTrans, mdDelta ) * mdZCurDer
            # mdDelta now refers to the current layer, not to the next
            amdAllGradBs[ -nL ] = mdDelta
            
            mdAPrevTrans = l_mdAllAs[ -nL-1 ].transpose()
            amdAllGradWs[ -nL ] = np.dot( mdDelta, mdAPrevTrans )

            # Explaining the above four lines is left as an exercise 
            # for the reader. (Hints: (i) Write out the matrix equation 
            # "straight," i.e., as used in the feed-forward pass. (ii) 
            # Use the rules for the matrix multiplications and the rules
            # for taking transposes.)
               
        return ( amdAllGradBs, amdAllGradWs )

    ############################################################################
    # Implements the stochastic gradient descent algorithm
    def SGD( self, sErrorsFileName, nEpochs, 
             TrainingData, nMiniBatchSize, TestData ):
        
        # This variable is used only for the x-axis during plotting the 
        # total activation errors at the OP layer.
        adErrors = np.zeros( nEpochs ) 

        # To have a small file-size, we persist the total error for the 
        # output layer only as averaged over the entire epoch (i.e., the 
        # entire training data, randomly shuffled). 
        foErrors = open( sErrorsFileName, 'w' )
        
        nTrainingCases = len( TrainingData )
       
        # An epoch is defined for the entire training data. However, each epoch
        # begins by randomly shuffling the training data. Effectively, we
        # change the initial position from which to begin descending down the
        # higher-dimensional total-cost-function surface. 
        for e in range( nEpochs ):

            # Create Mini-Batches covering the entire training data
            np.random.shuffle( TrainingData )
            MiniBatches = [ TrainingData[ k : k + nMiniBatchSize ] 
                            for k in range( 0, nTrainingCases, nMiniBatchSize ) ]
            
            dTotalErrorEpoch = 0
            # For each mini-batch
            for MiniBatch in MiniBatches:
                # Conduct the training over the entire mini-batch, 
                # and collect the errors accumulated over it. Add them
                # to the accumulated error for the current epoch. 
                l_mdErrorsMB = self.UpdateMiniBatch( dEta, nMiniBatchSize, MiniBatch )
                dAvgErrorMB = self.ComputeAvgErrorForMiniBatch( l_mdErrorsMB )
                dTotalErrorEpoch = dTotalErrorEpoch + dAvgErrorMB 

            dAvgErrorEpoch = dTotalErrorEpoch / nMiniBatchSize  
            adErrors[ e ] = dAvgErrorEpoch
            # Write to file
            sLine = "%E\n" % (dAvgErrorEpoch)
            foErrors.write( sLine )

            # self.Evaluate( TestData )
            print( "Epoch %d: Avg. Error: %lf" % (e, dAvgErrorEpoch) )

        foErrors.close()
        
        adx = np.arange( len(adErrors) )
        plt.plot( adx, adErrors )
        plt.xlabel( "Epochs" )
        plt.ylabel( "Total Error at the OP Layer\n(Epoch Avg. of Mini-Batch Avgs.)" )
        plt.title( "Training Errors" )
        plt.show()

    ############################################################################
    # Helper function
    def ComputeAvgErrorForMiniBatch( self, l_mdErrorsMB ):
        nSize = len( l_mdErrorsMB )
        dTotalErrorMB = 0 
        # For each training case in this mini-batch
        for i in range( nSize ):
            # Get the error vector for the i-th training case
            mdE = l_mdErrorsMB[ i ]
            dTotalErrorCase = mdE.sum()
            dTotalErrorMB = dTotalErrorMB + dTotalErrorCase
        # The average of the total errors for all cases in the current mini-batch
        dAvgErrorMB = dTotalErrorMB / nSize
        return dAvgErrorMB

    ############################################################################
    # This function is mostly useless in this simple a scenario; it predicts
    # 100 % accuracy right from 1st epoch! Obvious. With the kind of input 
    # data we generate, the prediction would have been 100 % accurate even
    # with a single function call to softmax! 
    #   
    # def Evaluate( self, TestData ):
    #     nTestCases = len( TestData )
    #     anPredictedRes = []
    #     for x, y in TestData:
    #         l_mdAllZs, l_mdAllAs = self.FeedForward( x )
            
    #         mdActivOP = l_mdAllAs[ -1 ] 
    #         nPredIdx = np.argmax( mdActivOP )
    #         nTargetIdx = np.argmax( y )
    #         anPredictedRes.append( int( nPredIdx == nTargetIdx ) )
    #     print( anPredictedRes )
    #     pass
################################################################################
# Class Network ends here
################################################################################


################################################################################
# main script
################################################################################

# Prepare Input Data

# The learning rate
dEta = 0.5 

# Number of Neurons contained In each successive Layer
anNIL = [10,10,10]

# Size of the input layer
nNeursInInput = anNIL[ 0 ]
# Size of the target layer
nTargets = anNIL[ -1 ]
# Total number of cases to have for training
nTotalNumCasesForTraining = 10000
# Total number of cases to have for testing
nTotalNumCasesForTesting = nTotalNumCasesForTraining // 5

# Number of cases to generate for each target for training
nCasesPerTargetForTraining = nTotalNumCasesForTraining // nTargets
# Number of cases to generate for each target for testing
nCasesPerTargetForTesting = nTotalNumCasesForTesting // nTargets


# For repeatability, seed the RNG in NumPy. For "true" random-ness
# at each program invocation, comment out the next line. Notice that 
# seeding, if any, must occur before generating data.   
np.random.seed( 0 )

TrainingData = GenerateData( anNIL[ 0 ], anNIL[ -1 ], nCasesPerTargetForTraining )
TestingData = GenerateData( anNIL[ 0 ], anNIL[ -1 ], nCasesPerTargetForTesting )

# Number of Epochs
nEpochs = 10
# Mini-Batch Size
nMiniBatchSize = 10

# Instantiate the network
# Optionally, we can have a re-seeding also here
# np.random.seed( 10000 )
nn = Network( anNIL )
# Let the network find the most optimum combination of values 
# for all the weights and biases it contains.
nn.SGD( "EpochErrors.txt", nEpochs, TrainingData, nMiniBatchSize, TestingData )

print( "Done!" )


Things for you to add and/or to try:

  • Add the code to persist the network (weights and biases) to a plain-text CSV file, and to re-init it from that file.
  • Add the code to show a simple (real-time) diagrammatic visualization of the training process for a tiny (say [4,3,2]) network, using a colormap-based shades being used for the lines of the synaptic weights and similarly colormap-based circles for neuron background for biases.
  • Write a code for a more ambitious code for generation of input data for n-dimensional data. It would have `m’ number of islands of volumes (of simplest topology) of randomly varying dark shades which are embedded into a randomly varying background of light shades. That is, it would be an n-dimensional generalization of taking the Windows Paint (or similar) program, having a nonuniform but light-gray fill-color, with dark-shaded blobs of arbitrary shapes randomly placed in it, sort of like a two-phase materials micro-structure. You have to take a configuration of such blobs and introduce random but small variations into its shape and position too. Then, take this small group of variations and assign a target value to it. Repeat for N number of targets.

 


No songs section this time. Will be back on 31st/1st, as promised (see the post just below).

A song I like:

[Realizing that you wouldn’t be in a mood to listen to (and be able to hum) a neat, melodious, golden oldie, may as well run it right away. … Yes, I will come back with my new year’s resolutions on the 31st/1st (see the post just below), but it’s likely that I won’t run any songs section at that time.]

(Hindi) “dhaDakne lage dil ke taron ki duniya…”
Singers: Mahendra Kapoor, Asha Bhosale
Music: N. Dutta
Lyrics: Sahir Ludhianvi

….Done. Take care and bye for now. See you on the 31st, with the new post…


History:
First published: 2018.12.27 23:29 IST
Updated (also the code): 2018.12.28 09.08 IST

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.