I took the liberty to re-type Nielsen’s[^] code [^], in the process introducing the Hungarian naming convention for variables, and also many other non-Pythonic and non-Nielsenic whimsies. Here is the end-result, for whatever worth it is. (Its sole intended purpose was to run the feed-forward and back-propagation algorithms in IDE.) Tested just a slight bit on Ubuntu Release 18.04.1 LTS (Bionic Beaver) 64-bit + Python 3.7.0 64-bit in conda 4.5.11 + MS VSCode 1.30.1.

It works. (Has to. Except for the cosmetic changes, it’s basically Nielsen’s code.) It produces this graph as an output:

“Enjoy!”

''' ANN.Nielsen.SmallerAndEasiestDataVectors.py -- Simple Artificial Neural Network. -- This code very closely follows "network.py given in Nielsen's book (chapter 1). It was modified a bit by Ajit R. Jadhav. This code Copyright (c) Ajit R. Jadhav. All rights reserved. -- Meant for developing understanding of ANNs through debugging in IDE (with ability to inspect values live at run-time that also can be verified manually, by taking *smaller* and *simpler* vectors for the input and target data-sets). -- Includes functions for generation of small and simple 1D input vectors and their corresponding target vectors. The input data generated here is, effectively, just a randomized version of Nielsen's "vectorized_result" function. -- Other features of this code: * Writes the total error for each epoch, to a file. * Plots the total error vs. epoch number. * Follows the Hungarian naming convention. However, the variable names may not reflect the best possible choices; this code was written in a hurry. -- Do whatever you like with it, but 100% on your own responsibility. -- But if you use this code for academic or corporate training purposes, then please do let me know by an email or a (confidential) comment here. Thanks in advance! -- History * Begun: 23 Dec. 2018 * Completed first version that works: 26 Dec. 2018 * This version: Sunday 30 December 2018 11:54:52 IST ''' import random import numpy as np import matplotlib.pyplot as plt ################################################################################ # Global helper functions ################################################################################ ################################################################################ # Helper function. Uses numpy.random.randn() to create inputs and targets, # whether for training or for testing. Useful for very simple 1D studies only. # (The input data as generated here very easily predict the targets! Even just # a single call to the softmax function would have been sufficient!!) def GenerateData( nNeursInInput, nLenOPLayer, nCasesPerTarget ): # Python lists for inputs and targets, returned at the end by this function. l_mdInput = [] l_mdTarget = [] for t in range( nLenOPLayer ): for i in range( nCasesPerTarget ): mdInput = np.random.rand( nNeursInInput, 1 )*0.1 mdInput[t][0] = mdInput[t][0] + np.random.random_sample()*0.1 + 0.8 l_mdInput.append( mdInput ) mdTarget = np.zeros( (nLenOPLayer,1) ) mdTarget[t][0] = 1.0 l_mdTarget.append( mdTarget ) # pickle returns data as arrays; let's simulate the same thing here amdInput = np.array( l_mdInput ) amdOutput = np.array( l_mdTarget ) # Python 3(+?) thing. Convert zip to list either here or later. # Better to do it here. Data = list( zip( amdInput, amdOutput ) ) return Data ################################################################################ # Helper function. Computes the sigmoid activation function def Sigmoid( mdZ ): return 1.0 / ( 1.0 + np.exp( - mdZ ) ) ################################################################################ # Helper function. Computes the derivative of the sigmoid activation function def SigmoidDerivative( mdZ ): mdA = Sigmoid( mdZ ) mdADer = mdA * ( 1.0 - mdA ) return mdADer ################################################################################ # Helper function. Called with a single activation vector with its # target vector. Assumes that the form of the cost function for each # neuron is (note carefully that target comes first): # C_x = 0.5 * ( dT - dA )^2 # so that # \partial C_x / \partial dA = - ( dT - dA ) = ( dA - dT ) # (where note that the activation comes first). def CostDerivative( mdA_OP, mdT ): return ( mdA_OP - mdT ) ################################################################################ # Class Network begins here ################################################################################ class Network(object): ############################################################################ # Constructor. # Supply an array containing number of neurons for each layer # e.g., [5,4,3] def __init__(self, anNIL): self.m_nLayers = len( anNIL ) self.m_anNIL = np.array( anNIL ) # anNIL[1:] means: An array with all entries except the first # anNIL[:-1] means: An array with all entries except for the last # We allocate (N,1)-shaped matrices rather than (N)-shaped vectors # because later on in back-propagation, to have a vectorized code, # we perform np.dot() in its tensor-product-like avatar, and there, # we need both the operands to be matrices. self.m_amdAllBs = [ np.random.randn( nNeursCur, 1 ) for nNeursCur in anNIL[1:] ] self.m_amdAllWs = [ np.random.randn( nNeursCur, nNeursPrev ) for nNeursPrev, nNeursCur in zip( anNIL[:-1], anNIL[1:] ) ] pass ############################################################################ # For each (input, target) tuple representing a row in the # mini-batch, perform the training. We compute the changes (delta's) # to be effected to biases and weights through feedback and # backpropagation, add them up until the mini-batch is over, and then # apply them to update the weights and biases only once at the end of # this function. Unlike in Nielsen's code, here we have a separate # function each for feed-forward and back-propagation def UpdateMiniBatch( self, dEta, nMiniBatchSize, MiniBatch ): # Allocate a local matrix each for holding the GradB and GradW matrices # for all the layers. Both are initialized to zero's. Can't directly # use np.zeros() because these matrices are not of uniform sizes; # the no. of neurons in each layer is arbitrarily different amdAllGradBs = [ np.zeros( mdGradB.shape ) for mdGradB in self.m_amdAllBs ] amdAllGradWs = [ np.zeros( mdGradW.shape ) for mdGradW in self.m_amdAllWs ] l_mdErrorsMB = [] # For each (input, target) tuple representing a row in the # mini-batch, perform the training. for x, y in MiniBatch: # Feed-Forward Pass: Feed the next input vector to the network # using the existing weights and biases, and find the net-input # and activations it produces at all the network layers. l_mdAllZs, l_mdAllAs = self.FeedForward( x ) # Compute the error vector mdDiff = y - l_mdAllAs[ -1 ] mdErrors = 0.5 * mdDiff * mdDiff l_mdErrorsMB.append( mdErrors ) # Back-Propagation Pass: Back-propagate the change in the total cost # function (total error) to the local gradients to be applied to # each weight (of a synaptic connection between current and previous # layers), and to each bias (of the current layer). amdDeltaGradB, amdDeltaGradW = self.BackProp( l_mdAllAs, l_mdAllZs, y ) # Add the changes to the local copies of the biases and weights. # The zip function takes iterable elements as input, and returns an # iterator for the tuple. Thus, zipping here makes it iterate through # each layer of the network. amdAllGradBs = [ mdGradB + mdDeltaGradB for mdGradB, mdDeltaGradB in zip( amdAllGradBs, amdDeltaGradB ) ] amdAllGradWs = [ mdGradW + mdDeltaGradW for mdGradW, mdDeltaGradW in zip( amdAllGradWs, amdDeltaGradW ) ] # Processing for all the rows in the current mini-batch is now over. # Now, update the network data-members from the local copies. # Note, we take an average of the changes over all rows. dEtaAvgOverMB = dEta / nMiniBatchSize # w means: the old weights matrix for a *single* layer only # nw means: \sum_{i}^{all rows of mini-batch} GradW_{i-th row} self.m_amdAllBs = [ mdB - dEtaAvgOverMB * mdGradB for mdB, mdGradB in zip( self.m_amdAllBs, amdAllGradBs ) ] self.m_amdAllWs = [ mdW - dEtaAvgOverMB * mdGradW for mdW, mdGradW in zip( self.m_amdAllWs, amdAllGradWs ) ] # Return the average error vector for this mini-batch return l_mdErrorsMB ############################################################################ # Called for a single input vector (i.e., "x" in Nielsen's code) def FeedForward( self, mdA ): # Python list of all activations, layer-by-layer l_mdAllAs = [ mdA ] # Python list for z vectors, layer-by-layer l_mdAllZs = [] # For the weight and bias matrices of each layer... for mdW, mdB in zip( self.m_amdAllWs, self.m_amdAllBs ): # Compute the net input vector to activate # the neurons of this layer mdZ = np.dot( mdW, mdA ) + mdB l_mdAllZs.append( mdZ ) # Compute the activations for all the neurons # of this layer mdA = Sigmoid( mdZ ) l_mdAllAs.append( mdA ) return ( l_mdAllZs, l_mdAllAs ) ############################################################################ # Called with inputs and activations for all the layers, # i.e., for the entire network in one go. def BackProp( self, l_mdAllAs, l_mdAllZs, mdTarget ): # Allocate a local matrix each for holding the GradB and GradW matrices # for all the layers. Both are initialized to zero's. Can't use # np.zeros() because these matrices are not of uniform sizes; # the no. of neurons in each layer is arbitrarily different amdAllGradBs = [ np.zeros( mdB.shape ) for mdB in self.m_amdAllBs ] amdAllGradWs = [ np.zeros( mdW.shape ) for mdW in self.m_amdAllWs ] # Back-propagation occurs in two distinct stages: # (i) In the first stage, we begin from the end (the output layer), and # use the layer just before that, in order to update the biases and # weights of the connections lying between the two (which are, in this # class, are stored in the OP layer). # (ii) In the second stage, we iterate back successively through the # network, starting from the last hidden layer, up to the first hidden # layer. # The split-up into two stages is necessary because the activation of a # neuron from the input- or hidden-layers flows through many (all) # neurons in the immediately next layer, and the total cost function # (i.e. the total error) is affected by *all* these paths. In contrast, # the output layer has no next layer, and so, the activation of a # neuron in the output layer affects the total cost function only # through its own connection to the total cost, not that of other # layers. So, the Delta at the OP layer can be directly # computed from a knowledge of the OP layer activations (predictions) # and the prescribed targets. But to compute the Delta at an # intermediate layer, we do need the Delta for the immediately # next layer. ########## # Stage I: Compute mdGradA and mdGradB for the Output Layer # The OP layer is the last, accessed through the -1 index. Get the # activations (A's) on its output side, and the net inputs (Z's) on its # input side. mdA_OP = l_mdAllAs[ -1 ] mdZ_OP = l_mdAllZs[ -1 ] # Compute the partial derivatives, and use them to find the Grad for # all B's in the OP layer. mdZ_OP_Der = SigmoidDerivative( mdZ_OP ) # As an intermediate product, find the delta for the OP layer. It is # subsequently used in the second stage as an input, thereby # back-propagating the changes to the total cost function back # throughout the network. mdDelta = CostDerivative( mdA_OP, mdTarget ) * mdZ_OP_Der # Compute the Grad for all B's in the OP layer amdAllGradBs[ -1 ] = mdDelta # Compute the partial derivatives, and find the Grad for all W's in # the layer just before the OP layer. Index -2 means: just before OP. # We use "LHL" to mean the last hidden layer. mdA_LHL_Transpose = l_mdAllAs[ -2 ].transpose() # Here, we build the GradW *matrix* from what essentially are two # vectors (or (N,1) matrices). Even if the numpy function says # "dot", it actually works like the tensor product---it outputs an # (N X N) matrix, not a single number as a scalar. It is for this step # that we made even B's an (N,1) shaped matrix in the constructor. amdAllGradWs[ -1 ] = np.dot( mdDelta, mdA_LHL_Transpose ) ########### # Stage II: Compute mdGradB and mdGradB for the last hidden layer # and go back, all the way up to the first hidden layer # Note that only the negative of the index ever gets used in the # following loop. # a[-2] means the second-last layer (i.e. the last hidden layer). # a[-1] means the last layer (i.e., the OP layer). # a[-2-1] means the layer just before the second-last layer. # The current layer starts from the output layer. The previous # layer is the one before the current. # If the indexing gets confusing, notice that Nielsen uses the range # (2, nLayers), rather than that of (1, nLayers-1) which would have # been a bit more natural, at least to me. But of course, his choice # is slightly more consistent from a mathematical viewpoint---and # from the fact that there are no weights and biases for the IP # layer. for nL in range( 2, self.m_nLayers ): # Find the mdDelta for the previous layer, using that for the # current layer... mdZCur = l_mdAllZs[ -nL ] mdZCurDer = SigmoidDerivative( mdZCur ) mdWNextTrans = self.m_amdAllWs[ -nL+1 ].transpose() # Note, for the very first pass through the loop, the mdDelta being # used on the RHS is what was earlier computed in the Stage I given # above (viz. for the output layer). It now gets updated here, so # that it can act as the "previous" mdDelta in the next pass # through this loop. mdDelta = np.dot( mdWNextTrans, mdDelta ) * mdZCurDer # mdDelta now refers to the current layer, not to the next amdAllGradBs[ -nL ] = mdDelta mdAPrevTrans = l_mdAllAs[ -nL-1 ].transpose() amdAllGradWs[ -nL ] = np.dot( mdDelta, mdAPrevTrans ) # Explaining the above four lines is left as an exercise # for the reader. (Hints: (i) Write out the matrix equation # "straight," i.e., as used in the feed-forward pass. (ii) # Use the rules for the matrix multiplications and the rules # for taking transposes.) return ( amdAllGradBs, amdAllGradWs ) ############################################################################ # Implements the stochastic gradient descent algorithm def SGD( self, sErrorsFileName, nEpochs, TrainingData, nMiniBatchSize, TestData ): # This variable is used only for the x-axis during plotting the # total activation errors at the OP layer. adErrors = np.zeros( nEpochs ) # To have a small file-size, we persist the total error for the # output layer only as averaged over the entire epoch (i.e., the # entire training data, randomly shuffled). foErrors = open( sErrorsFileName, 'w' ) nTrainingCases = len( TrainingData ) # An epoch is defined for the entire training data. However, each epoch # begins by randomly shuffling the training data. Effectively, we # change the initial position from which to begin descending down the # higher-dimensional total-cost-function surface. for e in range( nEpochs ): # Create Mini-Batches covering the entire training data np.random.shuffle( TrainingData ) MiniBatches = [ TrainingData[ k : k + nMiniBatchSize ] for k in range( 0, nTrainingCases, nMiniBatchSize ) ] dTotalErrorEpoch = 0 # For each mini-batch for MiniBatch in MiniBatches: # Conduct the training over the entire mini-batch, # and collect the errors accumulated over it. Add them # to the accumulated error for the current epoch. l_mdErrorsMB = self.UpdateMiniBatch( dEta, nMiniBatchSize, MiniBatch ) dAvgErrorMB = self.ComputeAvgErrorForMiniBatch( l_mdErrorsMB ) dTotalErrorEpoch = dTotalErrorEpoch + dAvgErrorMB dAvgErrorEpoch = dTotalErrorEpoch / nMiniBatchSize adErrors[ e ] = dAvgErrorEpoch # Write to file sLine = "%E\n" % (dAvgErrorEpoch) foErrors.write( sLine ) # self.Evaluate( TestData ) print( "Epoch %d: Avg. Error: %lf" % (e, dAvgErrorEpoch) ) foErrors.close() adx = np.arange( len(adErrors) ) plt.plot( adx, adErrors ) plt.xlabel( "Epochs" ) plt.ylabel( "Total Error at the OP Layer\n(Epoch Avg. of Mini-Batch Avgs.)" ) plt.title( "Training Errors" ) plt.show() ############################################################################ # Helper function def ComputeAvgErrorForMiniBatch( self, l_mdErrorsMB ): nSize = len( l_mdErrorsMB ) dTotalErrorMB = 0 # For each training case in this mini-batch for i in range( nSize ): # Get the error vector for the i-th training case mdE = l_mdErrorsMB[ i ] dTotalErrorCase = mdE.sum() dTotalErrorMB = dTotalErrorMB + dTotalErrorCase # The average of the total errors for all cases in the current mini-batch dAvgErrorMB = dTotalErrorMB / nSize return dAvgErrorMB ############################################################################ # This function is mostly useless in this simple a scenario; it predicts # 100 % accuracy right from 1st epoch! Obvious. With the kind of input # data we generate, the prediction would have been 100 % accurate even # with a single function call to softmax! # # def Evaluate( self, TestData ): # nTestCases = len( TestData ) # anPredictedRes = [] # for x, y in TestData: # l_mdAllZs, l_mdAllAs = self.FeedForward( x ) # mdActivOP = l_mdAllAs[ -1 ] # nPredIdx = np.argmax( mdActivOP ) # nTargetIdx = np.argmax( y ) # anPredictedRes.append( int( nPredIdx == nTargetIdx ) ) # print( anPredictedRes ) # pass ################################################################################ # Class Network ends here ################################################################################ ################################################################################ # main script ################################################################################ # Prepare Input Data # The learning rate dEta = 0.5 # Number of Neurons contained In each successive Layer anNIL = [10,10,10] # Size of the input layer nNeursInInput = anNIL[ 0 ] # Size of the target layer nTargets = anNIL[ -1 ] # Total number of cases to have for training nTotalNumCasesForTraining = 10000 # Total number of cases to have for testing nTotalNumCasesForTesting = nTotalNumCasesForTraining // 5 # Number of cases to generate for each target for training nCasesPerTargetForTraining = nTotalNumCasesForTraining // nTargets # Number of cases to generate for each target for testing nCasesPerTargetForTesting = nTotalNumCasesForTesting // nTargets # For repeatability, seed the RNG in NumPy. For "true" random-ness # at each program invocation, comment out the next line. Notice that # seeding, if any, must occur before generating data. np.random.seed( 0 ) TrainingData = GenerateData( anNIL[ 0 ], anNIL[ -1 ], nCasesPerTargetForTraining ) TestingData = GenerateData( anNIL[ 0 ], anNIL[ -1 ], nCasesPerTargetForTesting ) # Number of Epochs nEpochs = 10 # Mini-Batch Size nMiniBatchSize = 10 # Instantiate the network # Optionally, we can have a re-seeding also here # np.random.seed( 10000 ) nn = Network( anNIL ) # Let the network find the most optimum combination of values # for all the weights and biases it contains. nn.SGD( "EpochErrors.txt", nEpochs, TrainingData, nMiniBatchSize, TestingData ) print( "Done!" )

**Things for you to add and/or to try:**

- Add the code to persist the network (weights and biases) to a plain-text CSV file, and to re-init it from that file.
- Add the code to show a simple (real-time) diagrammatic visualization of the training process for a tiny (say [4,3,2]) network, using a colormap-based shades being used for the lines of the synaptic weights and similarly colormap-based circles for neuron background for biases.
- Write a code for a more ambitious code for generation of input data for n-dimensional data. It would have `m’ number of islands of volumes (of simplest topology) of randomly varying dark shades which are embedded into a randomly varying background of light shades. That is, it would be an n-dimensional generalization of taking the Windows Paint (or similar) program, having a nonuniform but light-gray fill-color, with dark-shaded blobs of arbitrary shapes randomly placed in it, sort of like a two-phase materials micro-structure. You have to take a configuration of such blobs and introduce random but small variations into its shape and position too. Then, take this small group of variations and assign a target value to it. Repeat for N number of targets.

~~No songs section this time. Will be back on 31st/1st, as promised (see the post just below).~~

**A song I like:**

[Realizing that you wouldn’t be in a mood to listen to (and be able to hum) a neat, melodious, golden oldie, may as well run it right away. … Yes, I will come back with my new year’s resolutions on the 31st/1st (see the post just below), but it’s likely that I won’t run any songs section at *that* time.]

(Hindi) “dhaDakne lage dil ke taron ki duniya…”

Singers: Mahendra Kapoor, Asha Bhosale

Music: N. Dutta

Lyrics: Sahir Ludhianvi

….Done. Take care and bye for now. See you on the 31st, with the new post…

**History:**

First published: 2018.12.27 23:29 IST

Updated (also the code): 2018.12.28 09.08 IST