Learnability of machine learning is provably an undecidable?—part 3: closure

Update on 23 January 2019, 17:55 IST:

In this series of posts, which was just a step further from the initial, brain-storming kind of a stage, I had come to the conclusion that based on certain epistemological (and metaphysical) considerations, Ben-David et al.’s conclusion (that learnability can be an undecidable) is logically untenable.

However, now, as explained here [^], I find that this particular conclusion which I drew, was erroneous. I now stand corrected, i.e., I now consider Ben-David et al.’s result to be plausible. Obviously, it merits a further, deeper, study.

However, even as acknowledging the above-mentioned mistake, let me also hasten to clarify that I still stick to my other positions, especially the central theme in this series of posts. The central theme here was that there are certain core features of the set theory which make implications such as Godel’s incompleteness theorems possible. These features (of the set theory) demonstrably carry a glaring epistemological flaw such that applying Godel’s theorem outside of its narrow technical scope in mathematics or computer science is not permissible. In particular, Godel’s incompleteness theorem does not apply to knowledge or its validation in the more general sense of these terms. This theme, I believe, continues to hold as is.

Update over.


Gosh! I gotta get this series out of my hand—and also head! ASAP, really!! … So, I am going to scrap the bits and pieces I had written for it earlier; they would have turned this series into a 4- or 5-part one. Instead, I am going to start entirely afresh, and I am going to approach this topic from an entirely different angle—a somewhat indirect but a faster route, sort of like a short-cut. Let’s get going.


Statements:

Open any article, research paper, book or a post, and what do you find? Basically, all these consist of sentences after sentences. That is, a series of statements, in a way. That’s all. So, let’s get going at the level of statements, from a “logical” (i.e. logic-thoretical) point of view.

Statements are made to propose or to identify (or at least to assert) some (or the other) fact(s) of reality. That’s what their purpose is.


The conceptual-level consciousness as being prone to making errors:

Coming to the consciousness of man, there are broadly two levels of cognition at which it operates: the sensory-perceptual, and the conceptual.

Examples of the sensory-perceptual level consciousness would consist of reaching a mental grasp of such facts of reality as: “This object exists, here and now;” “this object has this property, to this much degree, in reality,” etc. Notice that what we have done here is to take items of perception, and put them into the form of propositions.

Propositions can be true or false. However, at the perceptual level, a consciousness has no choice in regard to the truth-status. If the item is perceived, that’s it! It’s “true” anyway. Rather, perceptions are not subject to a test of truth- or false-hoods; they are at the very base standards of deciding truth- or false-hoods.

A consciousness—better still, an organism—does have some choice, even at the perceptual level. The choice which it has exists in regard to such things as: what aspect of reality to focus on, with what degree of focus, with what end (or purpose), etc. But we are not talking about such things here. What matters to us here is just the truth-status, that’s all. Thus, keeping only the truth-status in mind, we can say that this very idea itself (of a truth-status) is inapplicable at the purely perceptual level. However, it is very much relevant at the conceptual level. The reason is that at the conceptual level, the consciousness is prone to err.

The conceptual level of consciousness may be said to involve two different abilities:

  • First, the ability to conceive of (i.e. create) the mental units that are the concepts.
  • Second, the ability to connect together the various existing concepts to create propositions which express different aspects of the truths pertaining to them.

It is possible for a consciousness to go wrong in either of the two respects. However, mistakes are much more easier to make when it comes to the second respect.

Homework 1: Supply an example of going wrong in the first way, i.e., right at the stage of forming concepts. (Hint: Take a concept that is at least somewhat higher-level so that mistakes are easier in forming it; consider its valid definition; then modify its definition by dropping one of its defining characteristics and substituting a non-essential in it.)

Homework 2: Supply a few examples of going wrong in the second way, i.e., in forming propositions. (Hint: I guess almost any logical fallacy can be taken as a starting point for generating examples here.)


Truth-hood operator for statements:

As seen above, statements (i.e. complete sentences that formally can be treated as propositions) made at the conceptual level can, and do, go wrong.

We therefore define a truth-hood operator which, when it operates on a statement, yields the result as to whether the given statement is true or non-true. (Aside: Without getting into further epistemological complexities, let me note here that I reject the idea of the arbitrary, and thus regard non-true as nothing but a sub-category of the false. Thus, in my view, a proposition is either true or it is false. There is no middle (as Aristotle said), or even an “outside” (like the arbitrary) to its truth-status.)

Here are a few examples of applying the truth-status (or truth-hood) operator to a statement:

  • Truth-hood[ California is not a state in the USA ] = false
  • Truth-hood[ Texas is a state in the USA ] = true
  • Truth-hood[ All reasonable people are leftists ] = false
  • Truth-hood[ All reasonable people are rightists ] = false
  • Truth-hood[ Indians have significantly contributed to mankind’s culture ] = true
  • etc.

For ease in writing and manipulation, we propose to give names to statements. Thus, first declaring

A: California is not a state in the USA

and then applying the Truth-hood operator to “A”, is fully equivalent to applying this operator to the entire sentence appearing after the colon (:) symbol. Thus,

Truth-hood[ A ] <==> Truth-hood[ California is not a state in the USA ] = false


Just a bit of the computer languages theory: terminals and non-terminals:

To take a short-cut through this entire theory, we would like to approach the idea of statements from a little abstract perspective. Accordingly, borrowing some terminology from the area of computer languages, we define and use two types of symbols: terminals and non-terminals. The overall idea is this. We regard any program (i.e. a “write-up”) written in any computer-language as consisting of a sequence of statements. A statement, in turn, consists of certain well-defined arrangement of words or symbols. Now, we observe that symbols (or words) can be  either terminals or non-terminals.

You can think of a non-terminal symbol in different ways: as higher-level or more abstract words, as “potent” symbols. The non-terminal symbols have a “definition”—i.e., an expansion rule. (In CS, it is customary to call an expansion rule a “production” rule.) Here is a simple example of a non-terminal and its expansion:

  • P => S1 S2

where the symbol “=>” is taken to mean things like: “is the same as” or “is fully equivalent to” or “expands to.” What we have here is an example of an abstract statement. We interpret this statement as the following. Wherever you see the symbol “P,” you may substitute it using the train of the two symbols, S1 and S2, written in that order (and without anything else coming in between them).

Now consider the following non-terminals, and their expansion rules:

  • P1 => P2 P S1
  • P2 => S3

The question is: Given the expansion rules for P, P1, and P2, what exactly does P1 mean? what precisely does it stand for?

Answer:

  • P1 => (P2) P S1 => S3 (P) S1 => S3 S1 S2 S1

In the above, we first take the expansion rule for P1. Then, we expand the P2 symbol in it. Finally, we expand the P symbol. When no non-terminal symbol is left to expand, we arrive at our answer that “P1” means the same as “S3 S1 S2 S1.” We could have said the same fact using the colon symbol, because the colon (:) and the “expands to” symbol “=>” mean one and the same thing. Thus, we can say:

  • P1: S3 S1 S2 S1

The left hand-side and the right hand-side are fully equivalent ways of saying the same thing. If you want, you may regard the expression on the right hand-side as a “meaning” of the symbol on the left hand-side.

It is at this point that we are able to understand the terms: terminals and non-terminals.

The symbols which do not have any further expansion for them are called, for obvious reasons, the terminal symbols. In contrast, non-terminal symbols are those which can be expanded in terms of an ordered sequence of non-terminals and/or terminals.

We can now connect our present discussion (which is in terms of computer languages) to our prior discussion of statements (which is in terms of symbolic logic), and arrive at the following correspondence:

The name of every named statement is a non-terminal; and the statement body itself is an expansion rule.

This correspondence works also in the reverse direction.

You can always think of a non-terminal (from a computer language) as the name of a named proposition or statement, and you can think of an expansion rule as the body of the statement.

Easy enough, right? … I think that we are now all set to consider the next topic, which is: liar’s paradox.


Liar’s paradox:

The liar paradox is a topic from the theory of logic [^]. It has been resolved by many people in different ways. We would like to treat it from the viewpoint of the elementary computer languages theory (as covered above).

The simplest example of the liar paradox is , using the terminology of the computer languages theory, the following named statement or expansion rule:

  • A: A is false.

Notice, it wouldn’t be a paradox if the same non-terminal symbol, viz. “A” were not to appear on both sides of the expansion rule.

To understand why the above expansion rule (or “definition”) involves a paradox, let’s get into the game.

Our task will be to evaluate the truth-status of the named statement that is “A”. This is the “A” which comes on the left hand-side, i.e., before the colon.

In symbolic logic, a statement is nothing but its expansion; the two are exactly and fully identical, i.e., they are one and the same. Accordingly, to evaluate the truth-status of “A” (the one which comes before the colon), we consider its expansion (which comes after the colon), and get the following:

  • Truth-hood[ A ] = Truth-hood[ A is false ] = false           (equation 1)

Alright. From this point onward, I will drop explicitly writing down the Truth-hood operator. It is still there; it’s just that to simplify typing out the ensuing discussion, I am not going to note it explicitly every time.

Anyway, coming back to the game, what we have got thus far is the truth-hood status of the given statement in this form:

  • A: “A is false”

Now, realizing that the “A” appearing on the right hand-side itself also is a non-terminal, we can substitute for its expansion within the aforementioned expansion. We thus get to the following:

  • A: “(A is false) is false”

We can apply the Truth-hood operator to this expansion, and thereby get the following: The statement which appears within the parentheses, viz., the “A is false” part, itself is false. Accordingly, the Truth-hood operator must now evaluate thus:

  • Truth-hood[ A ] = Truth-hood[ A is false] = Truth-hood[ (A is false) is false ] = Truth-hood[ A is true ] = true            (equation 2)

Fun, isn’t it? Initially, via equation 1, we got the result that A is false. Now, via equation 2, we get the result that A is true. That is the paradox.

But the fun doesn’t stop there. It can continue. In fact, it can continue indefinitely. Let’s see how.

If only we were not to halt the expansions, i.e., if only we continue a bit further with the game, we could have just as well made one more expansion, and got to the following:

  • A: ((A is false) is false) is false.

The Truth-hood status of the immediately preceding expansion now is: false. Convince yourself that it is so. Hint: Always expand the inner-most parentheses first.

Homework 3: Convince yourself that what we get here is an indefinitely long alternating sequence of the Truth-hood statuses that: A is false, A is true, A is false, A is true

What can we say by way of a conclusion?

Conclusion: The truth-status of “A” is not uniquely decidable.

The emphasis is on the word “uniquely.”

We have used all the seemingly simple rules of logic, and yet have stumbled on to the result that, apparently, logic does not allow us to decide something uniquely or meaningfully.


Liar’s paradox and the set theory:

The importance of the liar paradox to our present concerns is this:

Godel himself believed, correctly, that the liar paradox was a semantic analogue to his Incompleteness Theorem [^].

Go read the Wiki article (or anything else on the topic) to understand why. For our purposes here, I will simply point out what the connection of the liar paradox is to the set theory, and then (more or less) call it a day. The key observation I want to make is the following:

You can think of every named statement as an instance of an ordered set.

What the above key observation does is to tie the symbolic logic of proposition with the set theory. We thus have three equivalent ways of describing the same idea: symbolic logic (name of a statement and its body), computer languages theory (non-terminals and their expansions to terminals), and set theory (the label of an ordered set and its enumeration).

As an aside, the set in question may have further properties, or further mathematical or logical structures and attributes embedded in itself. But at its minimal, we can say that the name of a named statement can be seen as a non-terminal, and the “body” of the statement (or the expansion rule) can be seen as an ordered set of some symbols—an arbitrarily specified sequence of some (zero or more) terminals and (zero or more) non-terminals.

Two clarifications:

  • Yes, in case there is no sequence in a production at all, it can be called the empty set.
  • When you have the same non-terminal on both sides of an expansion rule, it is said to form a recursion relation.

An aside: It might be fun to convince yourself that the liar paradox cannot be posed or discussed in terms of Venn’s diagram. The property of the “sheet” on which Venn’ diagram is drawn is, by some simple intuitive notions we all bring to bear on Venn’s diagram, cannot have a “recursion” relation.

Yes, the set theory itself was always “powerful” enough to allow for recursions. People like Godel merely made this feature explicit, and took full “advantage” of it.


Recursion, the continuum, and epistemological (and metaphysical) validity:

In our discussion above, I had merely asserted, without giving even a hint of a proof, that the three ways (viz., the symbolic logic of statements or  propositions, the computer languages theory, and the set theory) were all equivalent ways of expressing the same basic idea (i.e. the one which we are concerned about, here).

I will now once again make a few more observations, but without explaining them in detail or supplying even an indication of their proofs. The factoids I must point out are the following:

  • You can start with the natural numbers, and by using simple operations such as addition and its inverse, and multiplication and its inverse, you can reach the real number system. The generalization goes as: Natural to Whole to Integers to Rationals to Reals. Another name for the real number system is: the continuum.
  • You can use the computer languages theory to generate a machine representation for the natural numbers. You can also mechanize the addition etc. operations. Thus, you can “in principle” (i.e. with infinite time and infinite memory) represent the continuum in the CS terms.
  • Generating a machine representation for natural numbers requires the use of recursion.

Finally, a few words about epistemological (and metaphysical) validity.

  • The concepts of numbers (whether natural or real) have a logical precedence, i.e., they come first. The entire arithmetic and the calculus must come before does the computer-representation of some of their concepts.
  • A machine-representation (or, equivalently, a set-theoretic representation) is merely a representation. That is to say, it captures only some aspects or attributes of the actual concepts from maths (whether arithmetic or the continuum hypothesis). This issue is exactly like what we saw in the first and second posts in this series: a set is a concrete collection, unlike a concept which involves a consciously cast unit perspective.
  • If you try to translate the idea of recursion into the usual cognitive terms, you get absurdities such as: You can be your child, literally speaking. Not in the sense that using scientific advances in biology, you can create a clone of yourself and regard that clone to be both yourself and your child. No, not that way. Actually, such a clone is always your twin, not child, but still, the idea here is even worse. The idea here is you can literally father your own self.
  • Aristotle got it right. Look up the distinction between completed processes and the uncompleted ones. Metaphysically, only those objects or attributes can exist which correspond to completed mathematical processes. (Yes, as an extension, you can throw in the finite limiting values, too, provided they otherwise do mean something.)
  • Recursion by very definition involves not just absence of completion but the essence of the very inability to do so.

Closure on the “learnability issue”:

Homework 4: Go through the last two posts in this series as well as this one, and figure out that the only reason that the set theory allows a “recursive” relation is because a set is, by the design of the set theory, a concrete object whose definition does not have to involve an epistemologically valid process—a unit perspective as in a properly formed concept—and so, its name does not have to stand for an abstract mentally held unit. Call this happenstance “The Glaring Epistemological Flaw of the Set Theory” (or TGEFST for short).

Homework 5: Convince yourself that any lemma or theorem that makes use of Godel’s Incompleteness Theorem is necessarily based on TGEFST, and for the same reason, its truth-status is: it is not true. (In other words, any lemma or theorem based on Godel’s theorem is an invalid or untenable idea, i.e., essentially, a falsehood.)

Homework 6: Realize that the learnability issue, as discussed in Prof. Lev Reyzin’s news article (discussed in the first part of this series [^]), must be one that makes use of Godel’s Incompleteness Theorem. Then convince yourself that for precisely the same reason, it too must be untenable.

[Yes, Betteridge’s law [^] holds.]


Other remarks:

Remark 1:

As “asymptotical” pointed out at the relevant Reddit thread [^], the authors themselves say, in another paper posted at arXiv [^] that

While this case may not arise in practical ML applications, it does serve to show that the fundamental definitions of PAC learnability (in this case, their generalization to the EMX setting) is vulnerable in the sense of not being robust to changing the underlying set theoretical model.

What I now remark here is stronger. I am saying that it can be shown, on rigorously theoretical (epistemological) grounds, that the “learnability as undecidable” thesis by itself is, logically speaking, entirely and in principle untenable.

Remark 2:

Another point. My preceding conclusion does not mean that the work reported in the paper itself is, in all its aspects, completely worthless. For instance, it might perhaps come in handy while characterizing some tricky issues related to learnability. I certainly do admit of this possibility. (To give a vague analogy, this issue is something like running into a mathematically somewhat novel way into a known type of mathematical singularity, or so.) Of course, I am not competent enough to judge how valuable the work of the paper(s) might turn out to be, in the narrow technical contexts like that.

However, what I can, and will say is this: the result does not—and cannot—bring the very learnability of ANNs itself into doubt.


Phew! First, Panpsychiasm, and immediately then, Learnability and Godel. … I’ve had to deal with two untenable claims back to back here on this blog!

… My head aches….

… Code! I have to write some code! Or write some neat notes on ML in LaTeX. Only then will, I guess, my head stop aching so much…

Honestly, I just downloaded TensorFlow yesterday, and configured an environment for it in Anaconda. I am excited, and look forward to trying out some tutorials on it…

BTW, I also honestly hope that I don’t run into anything untenable, at least for a few weeks or so…

…BTW, I also feel like taking a break… May be I should go visit IIT Bombay or some place in konkan. … But there are money constraints… Anyway, bye, really, for now…


A song I like:

(Marathi) “hirvyaa hirvyaa rangaachi jhaaDee ghanadaaTa”
Music: Sooraj (the pen-name of “Shankar” from the Shankar-Jaikishan pair)
Lyrics: Ramesh Anavakar
Singers: Jaywant Kulkarni, Sharada


[Any editing would be minimal; guess I will not even note it down separately.] Did an extensive revision by 2019.01.21 23:13 IST. Now I will leave this post in the shape in which it is. Bye for now.

Advertisements

Learnability of machine learning is provably an undecidable?—part 1

Update on 23 January 2019, 17:55 IST:

In this series of posts, which was just a step further from the initial, brain-storming kind of a stage, I had come to the conclusion that based on certain epistemological (and metaphysical) considerations, Ben-David et al.’s conclusion (that learnability can be an undecidable) is logically untenable.

However, now, as explained here [^], I find that this particular conclusion which I drew, was erroneous. I now stand corrected, i.e., I now consider Ben-David et al.’s result to be plausible. Obviously, it merits a further, deeper, study.

However, even as acknowledging the above-mentioned mistake, let me also hasten to clarify that I still stick to my other positions, especially the central theme in this series of posts. The central theme here was that there are certain core features of the set theory which make implications such as Godel’s incompleteness theorems possible. These features (of the set theory) demonstrably carry a glaring epistemological flaw such that applying Godel’s theorem outside of its narrow technical scope in mathematics or computer science is not permissible. In particular, Godel’s incompleteness theorem does not apply to knowledge or its validation in the more general sense of these terms. This theme, I believe, continues to hold as is.

Update over.


This one news story has been lying around for about a week on my Desktop:

Lev Reyzin, “Unprovability comes to machine learning,” Nature, vol. 65, pp. 166–167, 10 January 2019 [^]. PDF here: [^]

(I’ve forgotten how I came to know about it though.) The story talks about the following recent research paper:

Ben-David et al., “Learnability can be undecidable,” Nature Machine Intelligence, vol. 1, pp. 44–48, January 2019 [^]. PDF here: [^]

I don’t have the requisite background in the theory of the research paper itself, and so didn’t even try to read through it. However, I did give Reyzin’s news article a try. It was not very successful; I have not yet been able to finish this story yet. However, here are a few notings which I made as I tried to progress through this news story. The quotations here all come from from Reyzin’s news story.

Before we begin, take a moment to notice that the publisher here is arguably the most reputed one in science, viz., the Nature publishing group. As to the undecidability of learnability, its apparent implications for practical machine learning, artificial intelligence, etc., are too obvious to be pointed out separately.


“During the twentieth century, discoveries in mathematical logic revolutionized our understanding of the very foundations of mathematics. In 1931, the logician Kurt Godel showed that, in any system of axioms that is expressive enough to model arithmetic, some true statements will be unprovable.”

Is it because Godel [^] assumed that any system of axioms (which is expressive enough to model arithmetic) would be based on the standard (i.e. mathematical) set theory? If so, his conclusion would not be all that paradoxical, because the standard set theory carries, from an epistemological angle, certain ill-conceived notions at its core. [BTW, throughout this (short) series of posts, I use Ayn Rand’s epistemological theory; see ITOE, 2e [^][^].]


To understand my position (that the set theory is not epistemologically sound), start with a simple concept like “table”.

According to Ayn Rand’s ITOE, the concept “table” subsumes all possible concrete instances of tables, i.e., all the tables that conceivably exist, might have ever existed, and might ever exist in future, i.e., a potentially infinite number of concrete instances of them. Ditto, for any other concept, e.g., “chair.” Concepts are mental abstractions that stand for an infinite concretes of a given kind.

Now, let’s try to run away from philosophy, and thereby come to rest in the arms of, say, a mathematical logician like Kurt Godel [^], or preferably, his predecessors, those who designed the mathematical set theory [^].

The one (utterly obvious) way to capture the fact that there exist tables, but only using the actual terms of the set theory, is to say that there is a set called “tables,” and that its elements consist of all possible tables (i.e., all the tables that might have existed, might conceivably exist, and would ever conceivably exist in future). Thus, the notion again refers to an infinity of concretes. Put into the terms of the set theory, the set of tables is an infinite set.

OK, that seems to work. How about chairs? Once again, you set up a set, now called “chairs,” and proceed to dump within its braces every possible or conceivable chair.

So far, so good. No trouble until now.


The trouble begins when you start applying operators to the sets, say by combining them via unions, or by taking their intersections, and so on—all that Venn’s diagram business [^]. But what is the trouble with the good old Venn diagrams, you ask? Well, the trouble is not so much to the Venn diagrams as it is to the basic set theory itself:

The set theory makes the notion of the set so broad that it allows you to combine any sets in any which way you like, and still be able to call the result a meaningful set—meaningful, as seen strictly from within the set theory.

Here is an example. You can not only combine (take union of) “tables” and “chairs” into a broader set called “furniture,” you are also equally well allowed, by the formalism of the set theory, to absorb into the same set all unemployed but competent programmers, Indian HR managers, and venture capitalists from the San Francisco Bay Area. The set theory does not by itself have anything in its theoretical structure, formalism or even mathematical application repertoire, using which it could possibly so much as raise a finger in such matters. This is a fact. If in doubt, refer to the actual set theory ([^] and links therein), take it strictly on its own terms, in particular, desist mixing into it any extra interpretations brought in by you.

Epistemology, on the other hand, does have theoretical considerations, including explicitly formulated rules at its core, which together allow us to distinguish between proper and improper formulations of concepts. For example, there is a rule that the concrete instances being subsumed under a concept must themselves be conceptually commensurate, i.e., they must possess the same defining characteristics, even if possibly to differing degrees. Epistemology prevents the venture capitalists from the San Francisco Bay Area from being called pieces of furniture because it clarifies that they are people, whereas pieces of furniture are inanimate objects, and for this crucial reason, the two are conceptually incommensurate—they cannot be integrated together into a common concept.

To come back to the set theory, it, however, easily allows for every abstractly conceivable “combination” for every possible operand set(s). Whether the operation has any cognitive merit to it or not, whether it results into any meaningful at all or not, is not at all a consideration—not by the design of the set theory itself (which, many people suppose, is more fundamental to every other theory).

So—and get this right—calling the collection of QC scientists as either politicians or scoundrels is not at all an abuse of the mathematical structure, content, and meaning of the set theory. The ability to take an intersection of the set of all mathematicians who publish papers and the set of all morons is not a bug, it is very much a basic and core feature of the set theory. There is absolutely nothing in the theory itself which says that the intersection operator cannot be applied here, or that the resulting set has to be an empty set. None.

Set theory very much neglects the considerations of the kind of a label there is to a set, and the kind of elements which can be added to it.

More on this, later. (This post has already gone past 1000 words.)


The songs section will come at the conclusion of this (short) series of posts, to be completed soon enough; stay tuned…

Data science code snippets—2: Using world’s simplest ANN to visualize that deep learning is hard

Good morning! [LOL!]


In the context of machine learning, “learning” only means: changes that get effected to the weights and biases of a network. Nothing more, nothing less.

I wanted to understand exactly how the learning propagates through the layers of a network. So, I wrote the following piece of code. Actually, the development of the code went through two distinct stages.


First stage: A two-layer network (without any hidden layer); only one neuron per layer:

In the first stage, I wrote what demonstrably is world’s simplest possible artificial neural network. This network comes with only two layers: the input layer, and the output layer. Thus, there is no hidden layer at all.

The advantage with such a network (“linear” or “1D” and without hidden layer) is that the total error at the output layer is a function of just one weight and one bias. Thus the total error (or the cost function) is a function of only two independent variables. Therefore, the cost function-surface can (at all) be directly visualized or plotted. For ease of coding, I plotted this function as a contour plot, but a 3D plot also should be possible.

Here are couple of pictures it produces. The input randomly varies in the interval [0.0, 1.0) (i.e., excluding 1.0), and the target is kept as 0.15.

The following plot shows an intermittent stage during training, a stage when the gradient descent algorithm is still very much in progress.

SImplest network with just input and output layers, with only one neuron per layer: Gradient descent in progress.

The following plot shows when the gradient descent has taken the configuration very close to the local (and global) minimum:

SImplest network with just input and output layers, with only one neuron per layer: Gradient descent near the local (and global) minimum.


Second stage: An n-layer network (with many hidden layers); only one neuron per layer

In the second stage, I wanted to show that even if you add one or more hidden layers, the gradient descent algorithm works in such a way that most of the learning occurs only near the output layer.

So, I generalized the code a bit to have an arbitrary number of hidden layers. However, the network continues to have only one neuron per layer (i.e., it maintains the topology of the bogies of a train). I then added a visualization showing the percent changes in the biases and weights at each layer, as learning progresses.

Here is a representative picture it produces when the total number of layers is 5 (i.e. when there are 3 hidden layers). It was made with both the biases and the weights all being set to the value of 2.0:

It is clearly seen that almost all of learning is limited only to the output layer; the hidden layers have learnt almost nothing!

Now, by way of a contrast, here is what happens when you have all initial biases of 0.0, and all initial weights of 2.0:

Here, the last hidden layer has begun learning enough that it shows some visible change during training, even though the output layer learns much more than does the last hidden layer. Almost all learning is via changes to weights, not biases.

Next, here is the reverse situation: when you have all initial biases of 2.0, but all initial weights of 0.0:

The bias of the hidden layer does undergo a slight change, but in an opposite (positive) direction. Compared to the case just above, the learning now is relatively more concentrated in the output layer.

Finally, here is what happens when you initialize both biases and weights to 0.0.

The network does learn (the difference in the predicted vs. target does go on diminishing as the training progresses). However, the percentage change is too small to be visually registered (when plotted to the same scale as what was used earlier).


The code:

Here is the code which produced all the above plots (but you have to suitably change the hard-coded parameters to get to each of the above cases):

'''
SimplestANNInTheWorld.py
Written by and Copyright (c) Ajit R. Jadhav. All rights reserved.
-- Implements the case-study of the simplest possible 1D ANN.
-- Each layer has only neuron. Naturally, there is only one target!
-- It even may not have hidden layers. 
-- However, it can have an arbitrary number of hidden layers. This 
feature makes it a good test-bed to see why and how the neurons 
in the hidden layers don't learn much during deep learning, during 
a ``straight-forward'' application of the gradient descent algorithm. 
-- Please do drop me a comment or an email 
if you find this code useful in any way, 
say, in a corporate training setup or 
in academia. Thanks in advance!
-- History:
* 30 December 2018 09:27:57  IST: 
Project begun
* 30 December 2018 11:57:44  IST: 
First version that works.
* 01 January 2019 12:11:11  IST: 
Added visualizations for activations and gradient descent for the 
last layer, for no. of layers = 2 (i.e., no hidden layers). 
* 01 January 2019 18:54:36  IST:
Added visualizations for percent changes in biases and 
weights, for no. of layers &amp;amp;amp;amp;gt;=3 (i.e. at least one hidden layer).
* 02 January 2019 08:40:17  IST: 
The version as initially posted on my blog.
'''
import numpy as np 
import matplotlib.pyplot as plt 

################################################################################
# Functions to generate the input and test data

def GenerateDataRandom( nTrainingCases ):
    # Note: randn() returns samples from the normal distribution, 
    # but rand() returns samples from the uniform distribution: [0,1). 
    adInput = np.random.rand( nTrainingCases )
    return adInput

def GenerateDataSequential( nTrainingCases ):
    adInput = np.linspace( 0.0, 1.0, nTrainingCases )
    return adInput

def GenerateDataConstant( nTrainingCases, dVal ):
    adInput = np.full( nTrainingCases, dVal )
    return adInput

################################################################################
# Functions to generate biases and weights

def GenerateBiasesWeightsRandom( nLayers ):
    adAllBs = np.random.randn( nLayers-1 )
    adAllWs = np.random.randn( nLayers-1 )
    return adAllBs, adAllWs

def GenerateBiasesWeightsConstant( nLayers, dB, dW ):
    adAllBs = np.ndarray( nLayers-1 )
    adAllBs.fill( dB )
    adAllWs = np.ndarray( nLayers-1 )
    adAllWs.fill( dW )
    return adAllBs, adAllWs

################################################################################
# Other utility functions

def Sigmoid( dZ ):
    return 1.0 / ( 1.0 + np.exp( - dZ ) )

def SigmoidDerivative( dZ ):
    dA = Sigmoid( dZ )
    dADer = dA * ( 1.0 - dA )
    return dADer

# Server function. Called with activation at the output layer. 
# In this script, the target value is always one and the 
# same, i.e., 1.0). 
# Assumes that the form of the cost function is: 
#       C_x = 0.5 * ( dT - dA )^2 
# where, note carefully, that target comes first. 
# Hence the partial derivative is:
#       \partial C_x / \partial dA = - ( dT - dA ) = ( dA - dT ) 
# where note carefully that the activation comes first.
def CostDerivative( dA, dTarget ):
    return ( dA - dTarget ) 

def Transpose( dA ):
    np.transpose( dA )
    return dA

################################################################################
# Feed-Forward

def FeedForward( dA ):
    ## print( "\tFeed-forward" )
    l_dAllZs = []
    # Note, this makes l_dAllAs have one extra data member
    # as compared to l_dAllZs, with the first member being the
    # supplied activation of the input layer 
    l_dAllAs = [ dA ]
    nL = 1
    for w, b in zip( adAllWs, adAllBs ):
        dZ = w * dA + b
        l_dAllZs.append( dZ )
        # Notice, dA has changed because it now refers
        # to the activation of the current layer (nL) 
        dA = Sigmoid( dZ )  
        l_dAllAs.append( dA )
        ## print( "\tLayer: %d, Z: %lf, A: %lf" % (nL, dZ, dA) )
        nL = nL + 1
    return ( l_dAllZs, l_dAllAs )

################################################################################
# Back-Propagation

def BackPropagation( l_dAllZs, l_dAllAs ):
    ## print( "\tBack-Propagation" )
    # Step 1: For the Output Layer
    dZOP = l_dAllZs[ -1 ]
    dAOP = l_dAllAs[ -1 ]
    dZDash = SigmoidDerivative( dZOP )
    dDelta = CostDerivative( dAOP, dTarget ) * dZDash
    dGradB = dDelta
    adAllGradBs[ -1 ] = dGradB

    # Since the last hidden layer has only one neuron, no need to take transpose.
    dAPrevTranspose = Transpose( l_dAllAs[ -2 ] )
    dGradW = np.dot( dDelta, dAPrevTranspose )
    adAllGradWs[ -1 ] = dGradW
    ## print( "\t* Layer: %d\n\t\tGradB: %lf, GradW: %lf" % (nLayers-1, dGradB, dGradW) )

    # Step 2: For all the hidden layers
    for nL in range( 2, nLayers ):
        dZCur = l_dAllZs[ -nL ]
        dZCurDash = SigmoidDerivative( dZCur )
        dWNext = adAllWs[ -nL+1 ]
        dWNextTranspose = Transpose( dWNext )
        dDot = np.dot( dWNextTranspose, dDelta ) 
        dDelta = dDot * dZCurDash
        dGradB = dDelta
        adAllGradBs[ -nL ] = dGradB

        dAPrev = l_dAllAs[ -nL-1 ]
        dAPrevTrans = Transpose( dAPrev )
        dGradWCur = np.dot( dDelta, dAPrevTrans )
        adAllGradWs[ -nL ] = dGradWCur

        ## print( "\tLayer: %d\n\t\tGradB: %lf, GradW: %lf" % (nLayers-nL, dGradB, dGradW) )

    return ( adAllGradBs, adAllGradWs )

def PlotLayerwiseActivations( c, l_dAllAs, dTarget ):
    plt.subplot( 1, 2, 1 ).clear()
    dPredicted = l_dAllAs[ -1 ]
    sDesc = "Activations at Layers. Case: %3d\nPredicted: %lf, Target: %lf" % (c, dPredicted, dTarget) 
    plt.xlabel( "Layers" )
    plt.ylabel( "Activations (Input and Output)" )
    plt.title( sDesc )
    
    nLayers = len( l_dAllAs )
    dES = 0.2	# Extra space, in inches
    plt.axis( [-dES, float(nLayers) -1.0 + dES, -dES, 1.0+dES] )
    
    # Plot a vertical line at the input layer, just to show variations
    plt.plot( (0,0), (0,1), "grey" )
    
    # Plot the dots for the input and hidden layers
    for i in range( nLayers-1 ):
        plt.plot( i, l_dAllAs[ i ], 'go' )
    # Plot the dots for the output layer
    plt.plot( nLayers-1, dPredicted, 'bo' )
    plt.plot( nLayers-1, dTarget, 'ro' )
    
def PlotGradDescent( c, dOrigB, dOrigW, dB, dW ):
    plt.subplot( 1, 2, 2 ).clear()
    
    d = 5.0
    ContourSurface( d )
    plt.axis( [-d, d, -d, d] ) 
    plt.plot( dOrigB, dOrigW, 'bo' )
    plt.plot( dB, dW, 'ro' )
    plt.grid()
    plt.xlabel( "Biases" )
    plt.ylabel( "Weights" )
    sDesc = "Gradient Descent for the Output Layer.\n" \
    "Case: %3d\nWeight: %lf, Bias: %lf" % (c, dW, dB) 
    plt.title( sDesc )
    
    
def ContourSurface( d ):
    nDivs = 10
    dDelta = d / nDivs
    w = np.arange( -d, d, dDelta )
    b = np.arange( -d, d, dDelta )
    W, B = np.meshgrid( w, b ) 
    A = Sigmoid( W + B )
    plt.imshow( A, interpolation='bilinear', origin='lower',
                cmap=plt.cm.Greys, # cmap=plt.cm.RdYlBu_r, 
                extent=(-d, d, -d, d), alpha=0.8 )
    CS = plt.contour( B, W, A )
    plt.clabel( CS, inline=1, fontsize=7 )

def PlotLayerWiseBiasesWeights( c, adOrigBs, adAllBs, adOrigWs, adAllWs, dPredicted, dTarget ):
    plt.clf()
    
    nComputeLayers = len( adOrigBs )
    plt.axis( [-0.2, nComputeLayers+0.7, -320.0, 320.0] )
    
    adBPct = GetPercentDiff( nComputeLayers, adAllBs, adOrigBs )
    adWPct = GetPercentDiff( nComputeLayers, adAllWs, adOrigWs )
    print( "Case: %3d" \
    "\nPercent Changes in Biases:\n%s" \
    "\nPercent Changes in Weights:\n%s\n" \
     % (c, adBPct, adWPct)  )
    adx = np.linspace( 0.0, nComputeLayers-1, nComputeLayers )
    plt.plot( adx + 1.0, adWPct, 'ro' )
    plt.plot( adx + 1.15, adBPct, 'bo' )
    plt.grid()
    plt.xlabel( "Layer Number" )
    plt.ylabel( "Percent Change in Weight (Red) and Bias (Blue)" )
    sTitle = "How most learning occurs only at an extreme layer\n" \
    "Percent Changes to Biases and Weights at Each Layer.\n" \
    "Training case: %3d, Target: %lf, Predicted: %lf" % (c, dTarget, dPredicted) 
    plt.title( sTitle )

def GetPercentDiff( n, adNow, adOrig ):
    adDiff = adNow - adOrig
    print( adDiff )
    adPct = np.zeros( n )
    dSmall = 1.0e-10
    if all( abs( adDiff ) ) &amp;amp;amp;amp;gt; dSmall and all( abs(adOrig) ) &amp;amp;amp;amp;gt; dSmall:
        adPct = adDiff / adOrig * 100.0
    return adPct


################################################################################
# The Main Script
################################################################################

dEta = 1.0 # The learning rate
nTrainingCases = 100
nTestCases = nTrainingCases // 5
adInput = GenerateDataRandom( nTrainingCases ) #, 0.0 )
adTest = GenerateDataRandom( nTestCases )

np.random.shuffle( adInput )
## print( "Data:\n %s" % (adInput) )

# Must be at least 2. Tested up to 10 layers.
nLayers = 2
# Just a single target! Keep it in the interval (0.0, 1.0), 
# i.e., excluding both the end-points of 0.0 and 1.0.

dTarget = 0.15

# The input layer has no biases or weights. Even the output layer 
# here has only one target, and hence, only one neuron.
# Hence, the weights matrix for all layers now becomes just a 
# vector.
# For visualization with a 2 layer-network, keep biases and weights 
# between [-4.0, 4.0]

# adAllBs, adAllWs = GenerateBiasesWeightsRandom( nLayers )
adAllBs, adAllWs = GenerateBiasesWeightsConstant( nLayers, 2.0, 2.0 )
dOrigB = adAllBs[-1]
dOrigW = adAllWs[-1]
adOrigBs = adAllBs.copy()
adOrigWs = adAllWs.copy()

## print( "Initial Biases\n", adAllBs )
## print( "Initial Weights\n", adAllWs )

plt.figure( figsize=(10,5) )
    
# Do the training...
# For each input-target pair,
for c in range( nTrainingCases ):
    dInput = adInput[ c ]
    ## print( "Case: %d. Input: %lf" % (c, dInput) )
    
    adAllGradBs = [ np.zeros( b.shape ) for b in adAllBs ]
    adAllGradWs = [ np.zeros( w.shape ) for w in adAllWs ]
    
    # Do the feed-forward, initialized to dA = dInput
    l_dAllZs, l_dAllAs = FeedForward( dInput )
    
    # Do the back-propagation
    adAllGradBs, adAllGradWs = BackPropagation( l_dAllZs, l_dAllAs )

    ## print( "Updating the network biases and weights" )
    adAllBs = [ dB - dEta * dDeltaB 
                for dB, dDeltaB in zip( adAllBs, adAllGradBs ) ]
    adAllWs = [ dW - dEta * dDeltaW 
                for dW, dDeltaW in zip( adAllWs, adAllGradWs ) ]
    
    ## print( "The updated network biases:\n", adAllBs )
    ## print( "The updated network weights:\n", adAllWs )

    if 2 == nLayers:
        PlotLayerwiseActivations( c, l_dAllAs, dTarget )
        dW = adAllWs[ -1 ]
        dB = adAllBs[ -1 ]
        PlotGradDescent( c, dOrigB, dOrigW, dB, dW )
    else:
        # Plot in case of many layers: Original and Current Weights, Biases for all layers
        # and Activations for all layers 
        dPredicted = l_dAllAs[ -1 ]
        PlotLayerWiseBiasesWeights( c, adOrigBs, adAllBs, adOrigWs, adAllWs, dPredicted, dTarget )
    plt.pause( 0.1 )

plt.show()

# Do the testing
print( "\nTesting..." )
for c in range( nTestCases ):
    
    dInput = adTest[ c ]
    print( "\tTest Case: %d, Value: %lf" % (c, dInput) )
    
    l_dAllZs, l_dAllAs = FeedForward( dInput )
    dPredicted = l_dAllAs[ -1 ]
    dDiff = dTarget - dPredicted
    dCost = 0.5 * dDiff * dDiff
    print( "\tInput: %lf, Predicted: %lf, Target: %lf, Difference: %lf, Cost: %lf\n" % (dInput, dPredicted, dTarget, dDiff, dCost) )

print( "Done!" )

 


Things you can try:

  • Change one or more of the following parameters, and see what happens:
    • Target value
    • Values of initial weights and biases
    • Number of layers
    • The learning rate, dEta
  • Change the cost function; e.g., try the linear function instead of the Sigmoid. Change the code accordingly.
  • Also, try to conceptually see what would happen when the number of neurons per layer is 2 or more…

Have fun!


A song I like:

(Marathi) “pahaaTe pahaaTe malaa jaag aalee”
Music and Singer: C. Ramchandra
Lyrics: Suresh Bhat

 

My new year’s resolutions—2019 edition

Here are my resolutions for the new year:


1. Get a suitable job in Data Science in Pune.

Revise the resume and upload / send out by January-end.


2. Wrap up my research on the non-relativistic QM:

Get a Beamer presentation (containing all the main points of the paper) out by Date-1.

Get the first version of the QM paper out, by Date-2

Submit the paper to a suitable journal which accepts papers on the foundations, by Date-3.

The optimistic-realistic-pessimistic estimates for the dates are:
Date-1: 28 Feb — 31 March — 31 May
Date-2: 31 March — 31 May — 31 July
Date-3: 30 April — 30 June — 31 August

The reason for the somewhat larger variance is that I will also be working in Data Science, and probably also just beginning doing a job in it. Much depends on how circumstances work out in this regard.

It’s very likely that QM will cease to be of much interest to me after that, though, of course, I will keep myself available for discussions concerning this paper. Another exception is noted near the end.


3. Self-metered ‘net access:

No more than just one hour of general surfing per day, preferably 45 minutes. The time spent on blogging, browsing news sites, reading personal blogs, emails, etc. is included.

Time taken by the OS and software updates, and by large downloads like software libraries, large data-sets, etc. is not included. Any time possibly spent on programming in the cloud, on browsing tutorials, help on python libraries etc., also is not included.


4. Daily exercises and meditation:

To quantify: “Daily” here means on at least 300 days a year.

Do some mild exercises at home, daily. (Really mild form of “exercise”s. In the main, just a few stretching exercises, and a few “surya-namaskaar”s. No need to exceed 12–15 “namaskaar”s a day eventually; begin with just 3 to 5.)

Offer a small prayer at home. Just about 10–15 minutes, but try to do it daily. (No particular time-slot in the day.)

Meditate more regularly, say for 15–30 minutes, at least 4 times a week. At least 10 minutes on the remaining days, just to keep a continuity.

Try: Taking a morning walk to a nearby garden at least 3 times a week, preferably daily (rainy days excluded). (Currently doable, but it depends on where I get my next job. If I have to spend some 3–4 hours a day in just commuting (the way I did during 2015–16), then no guilt for dropping this resolution.)

Come to think of it, I have done this all for extended periods (of several years). It was just that since moving to Mumbai (in 2013) onwards, there occurred a break. I am just going to pick up once again these good habits. All in all, should be easy to keep this resolution. Let’s see how it turns out.


5. Eat more salads:

Once in a job, try to have mostly just salads for the lunch (thus ensuring 5 meals a week predominantly of salads). Even otherwise, try to have salads for lunches, for about 15 days out of a month.

I have tried eating salads, and have found that, once again, this resolution too should be pretty easy to follow. Indeed, this is going to be the easiest one for me to keep. The reason is: really good salad services are available in Pune these days—not just veg. salads but also the greens + nonveg type of salads.


6. Begin cultivating a pet-project in Data Science:

Settle on an area and begin working on it this year.

The topic is likely to be: Rain-fall predictions.

A good, more modest initial goal might be to build a model for predicting just the temperatures in between October through May. That’s because predictions for temperatures in this period, I guess, would mostly involve only temperature and wind-speed data, which should be more easily available. (Data and predictions for pressure, humidity, and rainfalls might turn out to be a more difficult story.)


Things noticeably absent from my resolutions:

1. Restrictions on food & drinks. The idea is that the above resolutions themselves should lead to a better lifestyle so that restrictions as such aren’t necessary. And, in case the above resolutions also get broken, then trying to observe restrictions on just food and drinks is going to be pretty artificial, even just a “duty”! To be avoided.

2. Some other “Good habit”s like maintaining records of expenses on a daily basis, writing diary, etc. I just cannot maintain such things on a regular basis, so no point in making any resolutions about them.


Other things on the todo lists (though not resolutions):

1. After getting a job in Data Science, also try to explore a job as an Adjunct/Affiliate Professor in Pune. No more than 6 hours of commitment per week, including any time spent guiding student projects. For about 2 hours / week, even pro-bono positions can be considered, if the college is convenient for commute. Only for the computational topics of: Data Science / FEM / CFD / computational QM.

2. If possible, begin exploring relativistic QM. No time-frame being specified for its studies. It will be just an exploration. The only reason to include it here is that I believe my new approach is likely to simplify understanding the relativistic QM as well; so would just like to explore the simplest theoretical topics (at UG level) concerning the relativistic QM as well. (So far, I have summarily ignored it, but from now on, especially in the second half of the year, and especially my paper on non-relativistic QM is out, I would like to pursue it, just a bit.)

3. Participate in a Kaggle competition, especially in the second half of this year—purely for fun. If possible, do it right in the first half (though because of QM and all, it might not be possible, though if I get someone else suitable to form a team, this option would be still open).


Changes at this blog:

1. For the songs section, now on, I may from now on repeat some of the songs I have already run here.

It sometimes so happens that a song comes to me very naturally, I like it too, but it’s just that because I noted it on the blog some time ago, I cannot mention it again. In the new year, I am going to break this (self-made) rule.

2. I will also try to reduce the length of blog posts, preferably to within 1000 words/entry


A song I like:

(Western, instrumental): The title song of the movie “Chariots of Fire.”
Music: Vangelis. [Inspired from the song “City of Violets” by Stavros Logarides? See the note below.]

Note: I guess I had missed this movie (though I had watched its trailers in the movie halls many times back then in the early 1980s). Thus, the version of this song which I first listened to probably was not the original one [^], but some later rendition by someone / some orchestra, very possibly, that by Paul Mauriat. My primary memory of this song refers to this later version. Yesterday, when I checked out Paul Mauriat’s version [^], I felt that this wasn’t it. Some time in between, there also appeared a rendition by Yanni [^], and I liked it too. (I am sure that I had listened to this song before Yanni’s version came on this scene). Finally, just yesterday, I also caught, for the very first time, the London Olympics 2012 version (i.e., “Isles of Wonder” by the London Symphony Orchestra); once again, I found that it was a great rendition [^]. … It’s wonderful to see different emphases being made to the same “tune.”

Today, if I have to make a choice, I would pick up Paul Mauriat’s version [^] as the one I like best.

Incidentally, yesterday, while browsing the Wikipedia for this movie and the song, I also got to know for the first time about the plagiarism controversy involving this song [^], and so, I checked out Stavros Logarides’ song: “City of Violets” [^], especially this version [^]. The similarity is plain unmistakable. Even if Vangelis is a reputed composer, and has won not just the academy award but also the court-case (about the alleged plagiarism), if you ask me, the similarity is sufficient that I have no choice but to note Logarides’ name as well. After all, his song historically came first—whether Vangelis was inspired from it or not!


My approach, my credit:

The song controversy again highlights the reason why care must be taken by any original author, for protecting his IPR. … Another reason why I have been insisting on holding those informal seminars in the physics departments in this country, and why I got upset when all these physicists declined me.

The latest email I wrote (a couple of days ago) has been to Prof. Sunil Mukhi, HoD Physics, IISER Pune [^]; he also maintains this blog [^]. I wrote that email with a cc to Prof. Nilima Gupte [^] of IIT Madras, my alma mater. (Gupte and Mukhi were students at SUNY Stony Brook at the same time, I had gathered years ago, while reading the blog maintained by Gupte’s late husband.) As of this writing, I still await Mukhi’s reply.

The reason now to rush up at least a set of presentation slides (on my new approach to QM) has also to do with the fact that my computer was broken into, over the past few months. Best to hurry up the publication. Thus the resolution # 2 above.


Anyway, enough is enough. Any further editing will be very minor one, and even if I effect it, there won’t be any additions to my NYRs, for sure! For the same reason, I won’t even separately note such minor updates.

Bye for now, take care, and wish you all a happy (and a prosperous) new year!

Data science code snippets—1: My first ANN (a copy of Nielsen’s)

I took the liberty to re-type Nielsen’s[^] code [^], in the process introducing the Hungarian naming convention for variables, and also many other non-Pythonic and non-Nielsenic whimsies. Here is the end-result, for whatever worth it is. (Its sole intended purpose was to run the feed-forward and back-propagation algorithms in IDE.) Tested just a slight bit on Ubuntu Release 18.04.1 LTS (Bionic Beaver) 64-bit + Python 3.7.0 64-bit in conda 4.5.11 + MS VSCode 1.30.1.

It works. (Has to. Except for the cosmetic changes, it’s basically Nielsen’s code.) It produces this graph as an output:

Total Errors vs. Epochs

“Enjoy!”

'''
ANN.Nielsen.SmallerAndEasiestDataVectors.py
-- Simple Artificial Neural Network.
-- This code very closely follows "network.py given in Nielsen's book
(chapter 1). It was modified a bit by Ajit R. Jadhav. 
This code Copyright (c) Ajit R. Jadhav. All rights reserved.
-- Meant for developing understanding of ANNs through debugging in IDE 
(with ability to inspect values live at run-time that also can be 
verified manually, by taking *smaller* and *simpler* vectors for the input 
and target data-sets).
-- Includes functions for generation of small and simple 1D input vectors 
and their corresponding target vectors. The input data generated here is, 
effectively, just a randomized version of Nielsen's "vectorized_result" 
function.
-- Other features of this code:
* Writes the total error for each epoch, to a file.
* Plots the total error vs. epoch number. 
* Follows the Hungarian naming convention. However, the variable names 
may not reflect the best possible choices; this code was written in a hurry. 
-- Do whatever you like with it, but 100% on your own 
responsibility. 
-- But if you use this code for academic or corporate 
training purposes, then please do let me know by an 
email or a (confidential) comment here. 
Thanks in advance!
-- History
* Begun: 23 Dec. 2018 
* Completed first version that works: 26 Dec. 2018 
* This version: Sunday 30 December 2018 11:54:52  IST 
'''

import random
import numpy as np 
import matplotlib.pyplot as plt 

################################################################################
# Global helper functions
################################################################################

################################################################################ 
# Helper function. Uses numpy.random.randn() to create inputs and targets, 
# whether for training or for testing. Useful for very simple 1D studies only. 
# (The input data as generated here very easily predict the targets! Even just 
# a single call to the softmax function would have been sufficient!!)
def GenerateData( nNeursInInput, nLenOPLayer, nCasesPerTarget ):
    # Python lists for inputs and targets, returned at the end by this function. 
    l_mdInput = []
    l_mdTarget = []
    for t in range( nLenOPLayer ):
        for i in range( nCasesPerTarget ):
            mdInput = np.random.rand( nNeursInInput, 1 )*0.1
            mdInput[t][0] = mdInput[t][0] + np.random.random_sample()*0.1 + 0.8
            l_mdInput.append( mdInput )
            
            mdTarget = np.zeros( (nLenOPLayer,1) ) 
            mdTarget[t][0] = 1.0
            l_mdTarget.append( mdTarget )
    # pickle returns data as arrays; let's simulate the same thing here
    amdInput = np.array( l_mdInput )
    amdOutput = np.array( l_mdTarget )
    # Python 3(+?) thing. Convert zip to list either here or later.
    # Better to do it here. 
    Data = list( zip( amdInput, amdOutput ) )
    return Data
   
################################################################################ 
# Helper function. Computes the sigmoid activation function
def Sigmoid( mdZ ):
    return 1.0 / ( 1.0 + np.exp( - mdZ ) )

################################################################################ 
# Helper function. Computes the derivative of the sigmoid activation function
def SigmoidDerivative( mdZ ):
    mdA = Sigmoid( mdZ )
    mdADer = mdA * ( 1.0 - mdA )
    return mdADer

################################################################################ 
# Helper function. Called with a single activation vector with its 
# target vector. Assumes that the form of the cost function for each 
# neuron is (note carefully that target comes first): 
#       C_x = 0.5 * ( dT - dA )^2 
# so that  
#       \partial C_x / \partial dA = - ( dT - dA ) = ( dA - dT ) 
# (where note that the activation comes first).
def CostDerivative( mdA_OP, mdT ):
    return ( mdA_OP - mdT ) 


################################################################################
# Class Network begins here
################################################################################

class Network(object):

    ############################################################################
    # Constructor. 
    # Supply an array containing number of neurons for each layer
    # e.g., [5,4,3]  
    def __init__(self, anNIL):
        self.m_nLayers = len( anNIL )
        self.m_anNIL = np.array( anNIL )
        # anNIL[1:] means: An array with all entries except the first
        # anNIL[:-1] means: An array with all entries except for the last
        # We allocate (N,1)-shaped matrices rather than (N)-shaped vectors
        # because later on in back-propagation, to have a vectorized code,
        # we perform np.dot() in its tensor-product-like avatar, and there,
        # we need both the operands to be matrices. 
        self.m_amdAllBs = [ np.random.randn( nNeursCur, 1 ) 
                            for nNeursCur in anNIL[1:] ] 
        self.m_amdAllWs = [ np.random.randn( nNeursCur, nNeursPrev ) 
                            for nNeursPrev, nNeursCur in
                                zip( anNIL[:-1], anNIL[1:] ) ] 
        pass

    ############################################################################
    # For each (input, target) tuple representing a row in the 
    # mini-batch, perform the training. We compute the changes (delta's) 
    # to be effected to biases and weights through feedback and 
    # backpropagation, add them up until the mini-batch is over, and then
    # apply them to update the weights and biases only once at the end of 
    # this function. Unlike in Nielsen's code, here we have a separate 
    # function each for feed-forward and back-propagation
    def UpdateMiniBatch( self, dEta, nMiniBatchSize, MiniBatch ):

        # Allocate a local matrix each for holding the GradB and GradW matrices
        # for all the layers. Both are initialized to zero's. Can't directly 
        # use np.zeros() because these matrices are not of uniform sizes; 
        # the no. of neurons in each layer is arbitrarily different
        amdAllGradBs = [ np.zeros( mdGradB.shape ) for mdGradB in self.m_amdAllBs ]
        amdAllGradWs = [ np.zeros( mdGradW.shape ) for mdGradW in self.m_amdAllWs ]
        l_mdErrorsMB = []

        # For each (input, target) tuple representing a row in the 
        # mini-batch, perform the training. 
        
        for x, y in MiniBatch:
            
            # Feed-Forward Pass: Feed the next input vector to the network 
            # using the existing weights and biases, and find the net-input 
            # and activations it produces at all the network layers.
            
            l_mdAllZs, l_mdAllAs = self.FeedForward( x ) 
            
            # Compute the error vector
            mdDiff = y - l_mdAllAs[ -1 ]
            mdErrors = 0.5 * mdDiff * mdDiff
            l_mdErrorsMB.append( mdErrors )
            
            # Back-Propagation Pass: Back-propagate the change in the total cost 
            # function (total error) to the local gradients to be applied to 
            # each weight (of a synaptic connection between current and previous
            # layers), and to each bias (of the current layer). 
            amdDeltaGradB, amdDeltaGradW = self.BackProp( l_mdAllAs, l_mdAllZs, y )
            
            # Add the changes to the local copies of the biases and weights. 
            # The zip function takes iterable elements as input, and returns an 
            # iterator for the tuple. Thus, zipping here makes it iterate through 
            # each layer of the network.
            amdAllGradBs = [ mdGradB + mdDeltaGradB 
                        for mdGradB, mdDeltaGradB in zip( amdAllGradBs, amdDeltaGradB ) ]

            amdAllGradWs = [ mdGradW + mdDeltaGradW 
                        for mdGradW, mdDeltaGradW in zip( amdAllGradWs, amdDeltaGradW ) ]
        # Processing for all the rows in the current mini-batch is now over. 
        # Now, update the network data-members from the local copies.
         
        # Note, we take an average of the changes over all rows.
        dEtaAvgOverMB = dEta / nMiniBatchSize

        # w means: the old weights matrix for a *single* layer only
        # nw means: \sum_{i}^{all rows of mini-batch} GradW_{i-th row}
        self.m_amdAllBs = [ mdB - dEtaAvgOverMB * mdGradB 
            for mdB, mdGradB in zip( self.m_amdAllBs, amdAllGradBs ) ]

        self.m_amdAllWs = [ mdW - dEtaAvgOverMB * mdGradW 
            for mdW, mdGradW in zip( self.m_amdAllWs, amdAllGradWs ) ]

        # Return the average error vector for this mini-batch
        return l_mdErrorsMB

    ############################################################################
    # Called for a single input vector (i.e., "x" in Nielsen's code)
    def FeedForward( self, mdA ):

        # Python list of all activations, layer-by-layer 
        l_mdAllAs = [ mdA ]
        # Python list for z vectors, layer-by-layer
        l_mdAllZs = []

        # For the weight and bias matrices of each layer... 
        for mdW, mdB in zip( self.m_amdAllWs, self.m_amdAllBs ):
            # Compute the net input vector to activate 
            # the neurons of this layer
            mdZ = np.dot( mdW, mdA ) + mdB
            l_mdAllZs.append( mdZ )
            
            # Compute the activations for all the neurons
            # of this layer 
            mdA = Sigmoid( mdZ )
            l_mdAllAs.append( mdA )

        return ( l_mdAllZs, l_mdAllAs )

    ############################################################################
    # Called with inputs and activations for all the layers, 
    # i.e., for the entire network in one go. 
    def BackProp( self, l_mdAllAs, l_mdAllZs, mdTarget ):

        # Allocate a local matrix each for holding the GradB and GradW matrices
        # for all the layers. Both are initialized to zero's. Can't use 
        # np.zeros() because these matrices are not of uniform sizes; 
        # the no. of neurons in each layer is arbitrarily different
        amdAllGradBs = [ np.zeros( mdB.shape ) for mdB in self.m_amdAllBs ]
        amdAllGradWs = [ np.zeros( mdW.shape ) for mdW in self.m_amdAllWs ]

        # Back-propagation occurs in two distinct stages:
        # (i) In the first stage, we begin from the end (the output layer), and
        # use the layer just before that, in order to update the biases and 
        # weights of the connections lying between the two (which are, in this 
        # class, are stored in the OP layer). 
        # (ii) In the second stage, we iterate back successively through the 
        # network, starting from the last hidden layer, up to the first hidden 
        # layer.
        # The split-up into two stages is necessary because the activation of a
        # neuron from the input- or hidden-layers flows through many (all) 
        # neurons in the immediately next layer, and the total cost function 
        # (i.e. the total error) is affected by *all* these paths. In contrast, 
        # the output layer has no next layer, and so, the activation of a 
        # neuron in the output layer affects the total cost function only 
        # through its own connection to the total cost, not that of other 
        # layers. So, the Delta at the OP layer can be directly 
        # computed from a knowledge of the OP layer activations (predictions) 
        # and the prescribed targets. But to compute the Delta at an 
        # intermediate layer, we do need the Delta for the immediately 
        # next layer.  

        ##########
        # Stage I: Compute mdGradA and mdGradB for the Output Layer
        
        # The OP layer is the last, accessed through the -1 index. Get the 
        # activations (A's) on its output side, and the net inputs (Z's) on its 
        # input side.
        mdA_OP = l_mdAllAs[ -1 ] 
        mdZ_OP = l_mdAllZs[ -1 ] 
        
        # Compute the partial derivatives, and use them to find the Grad for 
        # all B's in the OP layer.  
        mdZ_OP_Der = SigmoidDerivative( mdZ_OP )
        # As an intermediate product, find the delta for the OP layer. It is
        # subsequently used in the second stage as an input, thereby 
        # back-propagating the changes to the total cost function back 
        # throughout the network. 
        mdDelta = CostDerivative( mdA_OP, mdTarget ) * mdZ_OP_Der
        # Compute the Grad for all B's in the OP layer
        amdAllGradBs[ -1 ] = mdDelta
        
        # Compute the partial derivatives, and find the Grad for all W's in 
        # the layer just before the OP layer. Index -2 means: just before OP.
        # We use "LHL" to mean the last hidden layer. 
        mdA_LHL_Transpose = l_mdAllAs[ -2 ].transpose()
        # Here, we build the GradW *matrix* from what essentially are two 
        # vectors (or (N,1) matrices). Even if the numpy function says
        # "dot", it actually works like the tensor product---it outputs an 
        # (N X N) matrix, not a single number as a scalar. It is for this step
        # that we made even B's an (N,1) shaped matrix in the constructor. 
        amdAllGradWs[ -1 ] = np.dot( mdDelta, mdA_LHL_Transpose )
        
        ###########
        # Stage II: Compute mdGradB and mdGradB for the last hidden layer
        #           and go back, all the way up to the first hidden layer 

        # Note that only the negative of the index ever gets used in the 
        # following loop. 
        # a[-2] means the second-last layer (i.e. the last hidden layer).
        # a[-1] means the last layer (i.e., the OP layer).
        # a[-2-1] means the layer just before the second-last layer.
        # The current layer starts from the output layer. The previous
        # layer is the one before the current. 
        # If the indexing gets confusing, notice that Nielsen uses the range 
        # (2, nLayers), rather than that of (1, nLayers-1) which would have 
        # been a bit more natural, at least to me. But of course, his choice 
        # is slightly more consistent from a mathematical viewpoint---and 
        # from the fact that there are no weights and biases for the IP 
        # layer.   
        for nL in range( 2, self.m_nLayers ):
            # Find the mdDelta for the previous layer, using that for the
            # current layer...  
            mdZCur = l_mdAllZs[ -nL ]
            mdZCurDer = SigmoidDerivative( mdZCur )
            mdWNextTrans = self.m_amdAllWs[ -nL+1 ].transpose()
            # Note, for the very first pass through the loop, the mdDelta being 
            # used on the RHS is what was earlier computed in the Stage I given 
            # above (viz. for the output layer). It now gets updated here, so 
            # that it can act as the "previous" mdDelta in the next pass 
            # through this loop.
            mdDelta = np.dot( mdWNextTrans, mdDelta ) * mdZCurDer
            # mdDelta now refers to the current layer, not to the next
            amdAllGradBs[ -nL ] = mdDelta
            
            mdAPrevTrans = l_mdAllAs[ -nL-1 ].transpose()
            amdAllGradWs[ -nL ] = np.dot( mdDelta, mdAPrevTrans )

            # Explaining the above four lines is left as an exercise 
            # for the reader. (Hints: (i) Write out the matrix equation 
            # "straight," i.e., as used in the feed-forward pass. (ii) 
            # Use the rules for the matrix multiplications and the rules
            # for taking transposes.)
               
        return ( amdAllGradBs, amdAllGradWs )

    ############################################################################
    # Implements the stochastic gradient descent algorithm
    def SGD( self, sErrorsFileName, nEpochs, 
             TrainingData, nMiniBatchSize, TestData ):
        
        # This variable is used only for the x-axis during plotting the 
        # total activation errors at the OP layer.
        adErrors = np.zeros( nEpochs ) 

        # To have a small file-size, we persist the total error for the 
        # output layer only as averaged over the entire epoch (i.e., the 
        # entire training data, randomly shuffled). 
        foErrors = open( sErrorsFileName, 'w' )
        
        nTrainingCases = len( TrainingData )
       
        # An epoch is defined for the entire training data. However, each epoch
        # begins by randomly shuffling the training data. Effectively, we
        # change the initial position from which to begin descending down the
        # higher-dimensional total-cost-function surface. 
        for e in range( nEpochs ):

            # Create Mini-Batches covering the entire training data
            np.random.shuffle( TrainingData )
            MiniBatches = [ TrainingData[ k : k + nMiniBatchSize ] 
                            for k in range( 0, nTrainingCases, nMiniBatchSize ) ]
            
            dTotalErrorEpoch = 0
            # For each mini-batch
            for MiniBatch in MiniBatches:
                # Conduct the training over the entire mini-batch, 
                # and collect the errors accumulated over it. Add them
                # to the accumulated error for the current epoch. 
                l_mdErrorsMB = self.UpdateMiniBatch( dEta, nMiniBatchSize, MiniBatch )
                dAvgErrorMB = self.ComputeAvgErrorForMiniBatch( l_mdErrorsMB )
                dTotalErrorEpoch = dTotalErrorEpoch + dAvgErrorMB 

            dAvgErrorEpoch = dTotalErrorEpoch / nMiniBatchSize  
            adErrors[ e ] = dAvgErrorEpoch
            # Write to file
            sLine = "%E\n" % (dAvgErrorEpoch)
            foErrors.write( sLine )

            # self.Evaluate( TestData )
            print( "Epoch %d: Avg. Error: %lf" % (e, dAvgErrorEpoch) )

        foErrors.close()
        
        adx = np.arange( len(adErrors) )
        plt.plot( adx, adErrors )
        plt.xlabel( "Epochs" )
        plt.ylabel( "Total Error at the OP Layer\n(Epoch Avg. of Mini-Batch Avgs.)" )
        plt.title( "Training Errors" )
        plt.show()

    ############################################################################
    # Helper function
    def ComputeAvgErrorForMiniBatch( self, l_mdErrorsMB ):
        nSize = len( l_mdErrorsMB )
        dTotalErrorMB = 0 
        # For each training case in this mini-batch
        for i in range( nSize ):
            # Get the error vector for the i-th training case
            mdE = l_mdErrorsMB[ i ]
            dTotalErrorCase = mdE.sum()
            dTotalErrorMB = dTotalErrorMB + dTotalErrorCase
        # The average of the total errors for all cases in the current mini-batch
        dAvgErrorMB = dTotalErrorMB / nSize
        return dAvgErrorMB

    ############################################################################
    # This function is mostly useless in this simple a scenario; it predicts
    # 100 % accuracy right from 1st epoch! Obvious. With the kind of input 
    # data we generate, the prediction would have been 100 % accurate even
    # with a single function call to softmax! 
    #   
    # def Evaluate( self, TestData ):
    #     nTestCases = len( TestData )
    #     anPredictedRes = []
    #     for x, y in TestData:
    #         l_mdAllZs, l_mdAllAs = self.FeedForward( x )
            
    #         mdActivOP = l_mdAllAs[ -1 ] 
    #         nPredIdx = np.argmax( mdActivOP )
    #         nTargetIdx = np.argmax( y )
    #         anPredictedRes.append( int( nPredIdx == nTargetIdx ) )
    #     print( anPredictedRes )
    #     pass
################################################################################
# Class Network ends here
################################################################################


################################################################################
# main script
################################################################################

# Prepare Input Data

# The learning rate
dEta = 0.5 

# Number of Neurons contained In each successive Layer
anNIL = [10,10,10]

# Size of the input layer
nNeursInInput = anNIL[ 0 ]
# Size of the target layer
nTargets = anNIL[ -1 ]
# Total number of cases to have for training
nTotalNumCasesForTraining = 10000
# Total number of cases to have for testing
nTotalNumCasesForTesting = nTotalNumCasesForTraining // 5

# Number of cases to generate for each target for training
nCasesPerTargetForTraining = nTotalNumCasesForTraining // nTargets
# Number of cases to generate for each target for testing
nCasesPerTargetForTesting = nTotalNumCasesForTesting // nTargets


# For repeatability, seed the RNG in NumPy. For "true" random-ness
# at each program invocation, comment out the next line. Notice that 
# seeding, if any, must occur before generating data.   
np.random.seed( 0 )

TrainingData = GenerateData( anNIL[ 0 ], anNIL[ -1 ], nCasesPerTargetForTraining )
TestingData = GenerateData( anNIL[ 0 ], anNIL[ -1 ], nCasesPerTargetForTesting )

# Number of Epochs
nEpochs = 10
# Mini-Batch Size
nMiniBatchSize = 10

# Instantiate the network
# Optionally, we can have a re-seeding also here
# np.random.seed( 10000 )
nn = Network( anNIL )
# Let the network find the most optimum combination of values 
# for all the weights and biases it contains.
nn.SGD( "EpochErrors.txt", nEpochs, TrainingData, nMiniBatchSize, TestingData )

print( "Done!" )


Things for you to add and/or to try:

  • Add the code to persist the network (weights and biases) to a plain-text CSV file, and to re-init it from that file.
  • Add the code to show a simple (real-time) diagrammatic visualization of the training process for a tiny (say [4,3,2]) network, using a colormap-based shades being used for the lines of the synaptic weights and similarly colormap-based circles for neuron background for biases.
  • Write a code for a more ambitious code for generation of input data for n-dimensional data. It would have `m’ number of islands of volumes (of simplest topology) of randomly varying dark shades which are embedded into a randomly varying background of light shades. That is, it would be an n-dimensional generalization of taking the Windows Paint (or similar) program, having a nonuniform but light-gray fill-color, with dark-shaded blobs of arbitrary shapes randomly placed in it, sort of like a two-phase materials micro-structure. You have to take a configuration of such blobs and introduce random but small variations into its shape and position too. Then, take this small group of variations and assign a target value to it. Repeat for N number of targets.

 


No songs section this time. Will be back on 31st/1st, as promised (see the post just below).

A song I like:

[Realizing that you wouldn’t be in a mood to listen to (and be able to hum) a neat, melodious, golden oldie, may as well run it right away. … Yes, I will come back with my new year’s resolutions on the 31st/1st (see the post just below), but it’s likely that I won’t run any songs section at that time.]

(Hindi) “dhaDakne lage dil ke taron ki duniya…”
Singers: Mahendra Kapoor, Asha Bhosale
Music: N. Dutta
Lyrics: Sahir Ludhianvi

….Done. Take care and bye for now. See you on the 31st, with the new post…


History:
First published: 2018.12.27 23:29 IST
Updated (also the code): 2018.12.28 09.08 IST

 

A general update

Hmmm… Slightly more than 3 weeks since I posted anything here. A couple of things happened in the meanwhile.


1. Wrapping up of writing QM scripts:

First, I wrapped up my simulations of QM. I had reached a stage (just in my mind, neither on paper nor on laptop) whereby the next thing to implement would have been: the simplest simulations using my new approach. … Ummm… I am jumping ahead of myself.

OK, to go back a bit. The way things happened, I had just about begun pursuing Data Science when this QM thingie (conference) suddenly came up. So, I had to abandon Data Science as is, and turn my attention full-time to QM. I wrote the abstract, sent it to the conference, and started jotting down some of the early points for the eventual paper. Frequent consultations with text-books was a part of it, and so was searching for any relevant research papers. Then, I also began doing simulations of the simplest textbook cases, just to see if I can find any simpler route from the standard / mainstream QM to my re-telling of the facts covered by it.

Then, as things turned out, my abstract for the conference paper got rejected. However, now that I had gotten a tempo for writing and running the simulations, I decided to complete at least those standard UG textbook cases before wrapping up this entire activity, and going back to Data Science. My last post was written when I was in the middle of this activity.

While thus pursuing the standard cases of textbook QM (see my last post), I also browsed a lot, thought a lot, and eventually found that simulations involving my approach shouldn’t take as long as a year, not even several months (as I had mentioned in my last post). What happened here was that during the aforementioned activity, I ended up figuring out a far simpler way that should still illustrate certain key ideas from my new approach.

So, the situation, say in the first week of December, was the following: (i) Because the proposed paper had been rejected, there was no urgency for me to continue working on the QM front. (ii) I had anyway found a simpler way to simulate my new approach, and the revised estimates were that even while working part-time, I should be able to finish the whole thing (the simulations and the paper) over just a few months’ period, say next year. (iii) At the same time, studies of Data Science had anyway been kept on the back-burner.

That’s how (and why) I came to wrap up all my activity on the QM front, first thing.

I then took a little break. I then turned back to Data Science.


2. Back to Data Science:

As far as learning Data Science goes, I knew from my past experience that books bearing titles such as: “Learn Artificial Intelligence in 3 Days,” or “Mastering Machine Learning in 24 Hours,” if available, would have been very deeply satisfying, even gratifying.

However, to my dismay, I found that no such titles exist. … Or, may be, such books are there, but someone at Google is deliberately suppressing the links to them. Whatever be the case, forget becoming a Guru in 24 hours (or even in 3 days), I found that no one was promising me that I could master even just one ML library (say TensorFlow, or at least scikit-learn) over even a longer period, say about week’s time or so.

Sure there were certain other books—you know, books which had blurbs and reader-reviews which were remarkably similar to what goes with those mastering-within-24-hours sort of books. However, these books had less appealing titles. I browsed through a few of these, and found that there simply was no way out; I would have to begin with Michael Nielsen’s book [^].

Which I did.

Come to think of it, the first time I had begun with Nielsen’s book was way back, in 2016. At that time, I had not gone beyond the first couple of sections of the first chapter or so. I certainly had not come to even going through the first code snippet that Nielsen gives, let alone running it, or trying any variations on it.

This time around, though, I decided to stick it out with this book. I had to. … What was the end result?

Well, unlike me, I didn’t take any jumps while going through this particular book. I began reading it in the given sequence, and then found that I could even continue with the same (i.e., reading in sequence)! I also made some furious underlines, margin-notes, end-notes, and all that. (That’s right. I was not reading this book online; I had first taken a printout.) I also sketched a few data structures in the margins, notably for the code around the “w” matrices. (I tend to suspect every one else’s data structures except for mine!) I pursued this activity covering about everything in the book, except for the last chapter. It was at this point that finally my patience broke down. I went back to my usual self and began jumping back and forth over the topics.

As a result, I can’t say that I have finished the book. But yes, I think I’ve got a fairly idea of what’s there in it.

So there.


3. What books to read after Nielsen’s?

Of course, Nielsen’s book wasn’t the only thing that I pursued over the past couple of weeks. I also very rapidly browsed through some other books, checked out the tutorial sites on libraries like scikit-learn, TensorFlow, etc. I came to figure out two things:

As the first thing, I found that I was unnecessarily getting tense when I saw young people casually toss around some fearsome words like “recurrent learning,” “convolutional networks,” “sentiments analysis,” etc., all with such ease and confidence. Not just on the ‘net but also in real life. … I came to see them do that when I attended a function for the final-rounds presentations at Intel’s national-level competition (which was held in IISER Pune, a couple of months ago or so). Since I had seen those quoted words (like “recurrent learning”) only while browsing through text-books or Wiki articles, I had actually come to feel a bit nervous at that event. Ditto, when I went through the Quora answers. Young people everywhere in the world seemed to have put in a lot of hard-work in studying Data Science. “When am I going to catch up with them, if ever?” I had thought.

It was only now, after going through the documentation and tutorials for these code libraries (like scikit-learn) that I came to realize that the most likely scenario here was that most of these kids were simply talking after trying out a few ready-made tutorials or so. … Why, one of the prize-winning (or at least, short-listed) presentations at that Intel competition was about the particles-swam optimization, and during their talk, the students had even shown a neat visualization of how this algorithm works when there are many local minima. I had got impressed a lot by that presentation. … Now I gathered that it was just a ready-made animated GIF lifted from KDNuggets or some other, similar, site… (Well, as it turns out, it must have been from the Wiki! [^])

As the second thing, I realized that for those topics which Nielsen doesn’t cover, good introductory books are hard to find. (That was a bit of an understatement. My real feel here is that, we are lucky that Nielsen’s book is at all available in the first place!)

…If you have any tips on a good book after Nielsen’s then please drop me an email or a comment; thanks in advance.


4. A tentative plan:

Anyway, as of now, a good plan seems to be: (i) first, to complete the first pass through Nielsen’t book (which should take just about a couple of days or so), and then, to begin pursuing all of the following, more or less completely simultaneously: (ii) locating and going through the best introductory books / tutorials on other topics in ML (like PCA, k-means, etc); (iii) running tutorials of ML libraries (like scikit-learn and TensorFlow); (iv) typing out LaTeX notes for Nielsen’s book (which would be useful eventually for such things as hyper-parameter tuning), and running modified (i.e., simplified) versions of his code (which means, the second pass through his book); and finally (v) begin cultivating some pet project from Data Science for moonlighting over a long period of time (just the way I have maintained a long-running interest in the micro-level water-resources engineering).

As to the topic for the pet project, here are the contenders as of today. I have not finalized anything just as yet (and am likely not to do so for quite some time), but the following seem to be attractive: (a) Predicting rainfall in India (though getting granular enough data is going to be a challenge), (b) Predicting earth-quakes (locations and/or intensities), (c) Identifying the Indian classical “raaga” of popular songs, etc. … I also have some other ideas but these are more in the nature of professional interests (especially, for application in engineering industries). … Once again, if you feel there is some neat idea that could be adopted for the pet project, then sure point it out to me. …


…Anyway, that’s about it! Time to sign off. Will come back next year—or if some code / notes get written before that, then even earlier, but no definite promises.

So, until then, happy Christmas, and happy new year!…


A song I like:

(Marathi) “mee maaze mohita…”
Lyrics: Sant Dnyaaneshwar
Music and Singer: Kishori Amonkar


[One editing pass is still due; should be effected within a day or two. Done on 2018.12.18 13:41 hrs IST.]

Some running thoughts on ANNs and AI—1

Go, see if you want to have fun with the attached write-up on ANNs [^] (but please also note the version time carefully—the write-up could change without any separate announcement).

The write-up is more in the nature of a very informal blabber of the kind that goes when people work out something on a research blackboard (or while mentioning something about their research to friends, or during brain-storming session, or while jotting things on the back of the envelop, or something similar).

 


A “song” I don’t like:

(Marathi) “aawaaj waaDaw DJ…”
“Credits”: Go, figure [^]. E.g., here [^]. Yes, the video too is (very strongly) recommended.


Update on 05 October 2018 10:31 IST:

Psychic attack on 05 October 2018 at around 00:40 IST (i.e. the night between 4th and 5th October, IST).

 

Caste Brahmins, classification, and ANN

1. Caste Brahmins:

First, a clarification: No, I was not born in any one of the Brahmin castes, particularly, not at all in the Konkanastha Brahmins’ caste.

Second, a suggestion: Check out how many caste-Brahmins have made it to the top in the Indian and American IT industry, and what sort of money they have made—already.

No, really.

If you at all bother visiting this blog, then I do want you to take a very serious note of both these matters.

No. You don’t have to visit this blog. But, yes, if you are going to visit this blog, to repeat, I do want you to take  matters like these seriously.

Some time ago, perhaps a year ago or so, a certain caste-Brahmin in Pune from some place (but he didn’t reveal his shakha, sub-caste, gotra, pravar, etc.) had insulted me, while maintaining a perfectly cool demeanor for himself, saying how he had made so much more money than me. Point taken.

But my other caste-Brahmin “friends” kept quiet at that time; not a single soul from them interjected.

In my off-the-cuff replies, I didn’t raise this point (viz., why these other caste-Brahmins were keeping quiet), but I am sure that if I were to do that, then, their typical refrain would have been (Marathi) “tu kaa chiDatos evhaDa, to tar majene bolat hotaa.” … English translation: Why do you get so angry? He was just joking.

Note the usual caste-Brahmin trick: they skillfully insert an unjustified premise; here, that you are angry!

To be blind to the actual emotional states or reactions of the next person, if he comes from some other caste, is a caste-habit with the caste-Brahmins. The whole IT industry is full of them—whether here in India, or there in USA/UK/elsewhere.

And then, today, another Brahmin—a Konkanastha—insulted me. Knowing that I am single, he asked me if I for today had taken the charge of the kitchen, and then, proceeded to invite my father to a Ganesh Pooja—with all the outward signs of respect being duly shown to my father.


Well, coming back to the point which was really taken:

Why have caste-Brahmins made so much money—to the point that they in one generation have begun very casually insulting the “other” people, including people of my achievements?

Or has it been the case that the people of the Brahmin castes always were this third-class, in terms of their culturally induced convictions, but that we did not come to know of it from our childhood, because the elderly people around us kept such matters, such motivations, hidden from us? May be in the naive hope that we would thereby not get influenced in a bad manner? Possible.

And, of course, how come these caste-Brahmins have managed to attract as much money as they did (salaries in excess of Rs. 50 lakhs being averagely normal in Pune) even as I was consigned only to receive “attract” psychic attacks (mainly from abroad) and insults (mainly from those from this land) during the same time period?

Despite all my achievements?

Do take matters like these seriously, but, of course, as you must have gathered by now, that is not the only thing I would have, to talk about. And, the title of this post anyway makes this part amply clear.


2. The classification problem and the ANNs:

I have begun my studies of the artificial neural networks (ANNs for short). I have rapidly browsed through a lot of introductory articles (as also the beginning chapters of books) on the topic. (Yes, including those written by Indians who were born in the Brahmin castes.) I might have gone through 10+ such introductions. Many of these, I had browsed through a few years ago (I mean only the introductory parts). But this time round, of course, I picked them up for a more careful consideration.

And soon enough (i.e. over just the last 2–3 days), I realized that no one in the field (AI/ML) was talking about a good explanation of this question:

Why is it that the ANN really succeeds as well as it does, when it comes to the classification tasks, but not others?

If you are not familiar with Data Science, then let me note that it is known that ANN does not do well on all the AI tasks. It does well only on one kind of them, viz., the classification tasks. … Any time you mention the more general term Artificial Intelligence, the layman is likely to think of the ANN diagram. However, ANNs are just one type of a tool that the Data Scientist may use.

But the question here is this: why does the ANN do so well on these tasks?

I formulated this question, and then found an answer too, and I would sure like to share it with you (whether the answer I found is correct or not). However, before sharing my answer, I want you to give it a try.

It would be OK by me if you answer this question in reference to just one or two concrete classification tasks—whichever you find convenient. For instance, if you pick up OCR (optical character recognition, e.g., as explained in Michael Nielson’s free online book [^]), then you have to explain why an ANN-based OCR algorithm works in classifying those MNIST digits / alphabets.


Hint: Studies of Vedic literature won’t help. [I should know!] OTOH, studies of good books on epistemology, or even just good accounts covering methods of science, should certainly come in handy.

I will give you all some time before I come back on that question.

In the meanwhile, have fun—if you wish to, and of course, if you are able to. With questions of this kind. (Translating the emphasis in the italics into chaste Marathi: “laayaki asali tar.” Got it?)


A song I like:
(Marathi) “ooncha nicha kaahi neNe bhagawant”
Lyrics: Sant Tukaram
Music and Singer: Snehal Bhatkar

 

Hushshshsh…

The title word of this post is Marathi for “phew!”—not for “hush hush.” (But, to me, the Marthi word is more expressive. [BTW, the “hu” is to be pronounced as “hoo”.])

The reason for this somewhat accentuated and prolonged exhalation is this: I am done with the version “0.1-beta” of my note on flux (see my last post). I need a break.

As of now, this note is about 27 pages, and with figures and some further (< 5%) additions, the final number of pages for the version 0.1 should easily go into the early 30s. … To be readable, it will have to brought down to about 15 pages or fewer, including diagrams. Preferably, < 10 pages. Some other day. For now, I find that I have grown plain sick and tired of working on that topic. I need to get away from it all. I am sure that I will return to it later on—may be after a month or so. But for the time being, I simply need a break—from it. I’ve had enough of this flux and vectors and tensors thing. … Which brings me to the next topic of this post.

There are also other announcements.


I think that I have also had enough of QM.

QM was interesting, very interesting, to me. It remained that way for some four decades. But now that I have cracked it (to my satisfaction), my interest in the topic has begun dwindling down very rapidly.

Sure I will conduct a few simulations, deliver the proposed seminar(s) I had mentioned in the past, and also write a paper or two about it. But I anticipate that I won’t go much farther than what I have already understood. The topic, now, simply has ceased to remain all that interesting.

But, yes, I have addressed all the essential QM riddles. That’s for certain.


And then, I was taking a stock of my current situation, and here are a few things that stood out:

  • I am not getting an academic job in Pune (because of the stupid / evil SPPU rules), and frankly, the time when a (full) Professor’s job could have meant something to me is already over. If I were to get such a job well in time—which means years ago—then I could have done some engineering research (especially in CFD), guided a few students (in general in computational science and engineering), taught courses, developed notes, etc. But after having lost a decade or so due to that stupid and/or evil Metallurgy-vs-Mechanical Branch Jumping issue, I don’t have the time to pursue all that sort of a thing, any more.
  • You would know this: All my savings are over; I am already in debts.
  • I do not wish to get into a typical IT job. It could be well paying, but it involves absolutely no creativity and originality—I mean creativity involving theoretical aspects. Deep down in my heart, I remain a “theoretician”—and a programmer. But not a manager. There is some scope for creativity in the Indian IT industry, but at my “seniority,” it is mostly limited, in one way or the other, only to “herd-management” (to use an expression I have often heard from my friends in the industry). And, I am least bothered about that. So, to say that by entering the typical Indian IT job, my best skills (even “gifts”) would go under-utilized, would be an understatement.
  • For someone like me, there is no more scope, in Pune, in the CFD field either. Consultants and others are already well established. I could keep my eyes and ears open. But it looks dicey to rely on this option. The best period for launching industrial careers in CFD here was, say, up to the early naughties. … I still could continue with some research in CFD. But guess it no longer is a viable career option for me. Not in Pune.
  • Etc.

However, all is not gloomy. Not at all. Au contraire.

I am excited that I am now entering a new field.

I will not ask you to take a guess. This career route for people with my background and skills is so well-established by now, that there aren’t any more surprises left in it. Even an ANN would be able to guess it right.

Yes, that’s right. From now on, I am going to pursue Data Science.


This field—Data Science—has a lot of attractive features, as far as I am concerned. The way I see it, the following two stand out:

  1. There is a very wide variety of application contexts; and
  2. There is a fairly wide range of mathematical skills that you have to bring to bear on these problems.

Notice, the emphasis is on the width, not on the depth.

The above-mentioned two features, in turn, lead to or help explain many other features, like:

  1. A certain open ended-ness of solutions—pretty much like what you have in engineering research and design. In particular, one size doesn’t fit all.
  2. A relatively higher premium on the individual thinking skills—unlike what your run-of-the-mill BE in CS does, these days [^].

Yes, Data Science, as a field, will come to mature, too. The first sign that it is reaching the first stage of maturity would be an appearance of a book like “Design Patterns.”

However, even this first stage is, I anticipate, distant in future. All in all, I anticipate that the field will not come to mature before some 7–10 years pass by. And that’s long enough a time for me to find some steady career option in the meanwhile.


There are also some other plus points that this field holds from my point of view.

I have coded extensively—more than 1 lakh (100,000) lines of C++ code in all, before I came to stop using C++, which happened especially after I entered academia. I am already well familiar with Python and its eco-system, though am not an expert when it comes to writing the fastest possible numerical code in Python.

I have handled a great variety of maths. The list of equations mentioned in my recent post [^] is not even nearly exhaustive. (For instance, it does not even mention whole topics like probability and statistics, stereology, and many such matters.) When it comes to Data Science, a prior experience with a wide variety of maths is a plus point.

I have not directly worked with topics like artificial neural networks, deep learning, the more advanced regression analysis, etc.

However, realize that for someone like me, i.e., someone who taught FEM, and had thought of accelerating solvers via stochastic means, the topic of constrained optimization would not be an entirely unknown animal. Some acquaintance has already been made with the conjugate gradient (though I can’t claim mastery of it). Martingales theory—the basic idea—is not a complete unknown. (I had mentioned a basic comparison of my approach vis-a-vis the simplest or the most basic martingales theory, in my PhD thesis.)

Other minor points are these. This field also (i) involves visualization abilities, (ii) encourages good model building at the right level of abstraction, and (iii) places premium on presentation. I am not sure if I am good on the third count, but I sure am confident that I do pretty well on the first two. The roots of all my new research ideas, in fact, can be traced back to having to understand physical principles in 3+1 D settings.


Conclusion 1: I should be very much comfortable with Data Science. (Not sure if Data Science itself (i.e., Data Scientists themselves) would be comfortable with me or not. But that’s something I could deal later on.)

Conclusion 2: Expect blogging here going towards Data Science in the future.


A Song I Like:

(Marathi) “uff, teri adaa…”
Music: Shankar-Ahsaan-Loy
Singer: Shankar Mahadevan
Lyrics: Javed Akhtar

[By any chance, was this tune at least inspired (if not plagiarized) from some Western song? Or is it through and through original? …In any case, I like it a lot. I find it wonderful. It’s upbeat, but not at all banging on the ears. (For contrast, attend any Ganapati procession, whether on the first day, the last day, or any other day in between. You will have ample opportunities to know what having your ears banged out to deafness means. Nay, these opportunities will be thrust upon you, whether you like it or not. It’s our “culture.”)]

 

 

Links…

Here are a few interesting links I browsed recently, listed in no particular order:


“Mathematicians Tame Turbulence in Flattened Fluids” [^].

The operative word here, of course, is: “flattened.” But even then, it’s an interesting read. Another thing: though the essay is pop-sci, the author gives the Navier-Stokes equations, complete with fairly OK explanatory remarks about each term in the equation.

(But I don’t understand why every pop-sci write-up gives the NS equations only in the Lagrangian form, never Eulerian.)


“A Twisted Path to Equation-Free Prediction” [^]. …

“Empirical dynamic modeling.” Hmmm….


“Machine Learning’s `Amazing’ Ability to Predict Chaos” [^].

Click-bait: They use data science ideas to predict chaos!

8 Lyapunov times is impressive. But ignore the other, usual kind of hype: “…the computer tunes its own formulas in response to data until the formulas replicate the system’s dynamics. ” [italics added.]


“Your Simple (Yes, Simple) Guide to Quantum Entanglement” [^].

Click-bait: “Entanglement is often regarded as a uniquely quantum-mechanical phenomenon, but it is not. In fact, it is enlightening, though somewhat unconventional, to consider a simple non-quantum (or “classical”) version of entanglement first. This enables us to pry the subtlety of entanglement itself apart from the general oddity of quantum theory.”

Don’t dismiss the description in the essay as being too simplistic; the author is Frank Wilczek.


“A theoretical physics FAQ” [^].

Click-bait: Check your answers with those given by an expert! … Do spend some time here…


Tensor product versus Cartesian product.

If you are engineer and if you get interested in quantum entanglement, beware of the easily confusing terms: The tensor product and the Cartesian product.

The tensor product, you might think, is like the Cartesian product. But it is not. See mathematicians’ explanations. Essentially, the basis sets (and the operations) are different. [^] [^].

But what the mathematicians don’t do is to take some simple but non-trivial examples, and actually work everything out in detail. Instead, they just jump from this definition to that definition. For example, see: “How to conquer tensorphobia” [^] and “Tensorphobia and the outer product”[^]. Read any of these last two articles. Any one is sufficient to give you tensorphobia even if you never had it!

You will never run into a mathematician who explains the difference between the two concepts by first directly giving you a vague feel: by directly giving you a good worked out example in the context of finite sets (including enumeration of all the set elements) that illustrates the key difference, i.e. the addition vs. the multiplication of the unit vectors (aka members of basis sets).

A third-class epistemology when it comes to explaining, mathematicians typically have.


A Song I Like:

(Marathi) “he gard niLe megha…”
Singers: Shailendra Singh, Anuradha Paudwal
Music: Rushiraj
Lyrics: Muralidhar Gode

[As usual, a little streamlining may occur later on.]