Learnability of machine learning is provably an undecidable?—part 2

In this post, we look into the differences of the idea of sets from that of concepts. The discussion here is exploratory, and hence, not very well isolated. There are overlaps of points between sections. Indeed, there are going to be overlaps of points from post to post too! The idea behind this series of posts is not to present a long thought out and matured point of view; it is much in the nature of jotting down salient points and trying to bring some initial structure to them. Thus the writing in this series is just a step further from the stage of brain-storming, really speaking.

There is no direct discussion in this post regarding the learnability issue at all. However, the points we note here are crucial to understanding Godel’s incompleteness theorem, and in that sense, the contents of this post are crucially important in framing the learnability issue right.

Anyway, let’s get going over the differences of sets and concepts.


A concept as an abstract unit of mental integration:

Concepts are mental abstractions. It is true that concepts, once formed, can themselves be regarded as mental units, and qua units, they can further be integrated together into even higher-level concepts, or possibly sub-divided into narrower concepts. However, regardless of the level of abstraction at which a given concept exists, the concretes being subsumed under it are necessarily required to be less abstract than the single mental unit that is the concept itself.

Using the terms of computer science, the “graph” of a concept and its associated concrete units is not only acyclic and directional (from the concretes to the higher-level mental abstraction that is the concept), its connections too can be drawn if and only if the concretes satisfy the rules of conceptual commensurability.

A concept is necessarily a mental abstraction, and as a unit of mental integration, it always exists at a higher level of abstraction as compared to the units it subsumes.


A set as a mathematical object that is just a concrete collection:

Sets, on the other hand, necessarily are just concrete objects in themselves, even if they do represent collections of other concrete objects. Sets take birth as concrete objects—i.e., as objects that don’t have to represent any act of mental isolation and integration—and they remain that way till the end of their life.

For the same reason, set theory carries absolutely no rules whereby constraints can be placed on combining sets. No meaning is supposed to be assigned to the very act of placing braces around the rule which defines admissibility of objects as members into a set (or that of enumeration of their member objects).

The act of creating the collection that is a set is formally allowed to proceed even in the absence of any preceding act of mental differentiations and integrations.

This distinction between these two ideas, the idea of a concept, and that of a set, is important to grasp.


An instance of a mental abstraction vs. a membership into a concrete collection:

In the last post in this series, I had used the terminology in a particular way: I had said that there is a concept “table,” and that there is a set of “tables.” The plural form for the idea of the set was not a typo; it was a deliberate device to highlight this same significant point, viz., the essential concreteness of any set.

The mathematical theory of sets didn’t have to be designed this way, but given the way it anyway has actually been designed, one of the inevitable implications of its conception—its very design—has been this difference which exists between the ideas of concepts and sets. Since this difference is extremely important, it may be worth our while to look at it from yet another viewpoint.

When we look at a table and, having already had reached the concept of “table” we affirm that the given concrete table in front of us is indeed a table, this seemingly simple and almost instantaneously completed act of recognition itself implicitly involves a complex mental process. The process includes invoking a previously generated mental integration—an integration which was, sometime in the past, performed in reference to those attributes which actually exist in reality and which make a concrete object a table. The process begins with the availability of this context as a pre-requisite, and now involves an application of the concept. It involves actively bringing forth the pre-existing mental integration, actively “see” that yet another concrete instance of a table does indeed in reality carry the attributes which make an object a table, and thereby concluding that it is a table.

In other words, if you put the concept table symbolically as:

table = { this table X, that table Y, now yet another table Z, … etc. }

then it is understood that what the symbol on the left hand side stands for is a mental integration, and that each of the concrete entities X, Y, Z, etc. appearing in the list on the right hand-side is, by itself, an instance corresponding to that unit of mental integration.

But if you interpret the same “equation” as one standing for the set “tables”, then strictly speaking, according to the actual formalism of the set theory itself (i.e., without bringing into the context any additional perspective which we by habit do, but sticking strictly only to the formalism), each of the X, Y, Z etc. objects remains just a concrete member of a merely concrete collection or aggregate that is the set. The mental integration which regards X, Y, Z as equally similar instances of the idea of “table” is missing altogether.

Thus, no idea of similarity (or of differences) among the members at all gets involved, because there is no mental abstraction: “table” in the first place. There are only concrete tables, and there is a well-specified but concrete object, a collective, which is only formally defined to be stand for this concrete collection (of those specified tables).

Grasp this difference, and the incompleteness paradox brought forth by Godel begins to dissolve away.


The idea of an infinite set cuts out the preceding theoretical context:

Since the aforementioned point is complex but important, there is no risk in repeating (though there could be boredom!):

There is no place-holder in the set theory which would be equivalent to saying: “being able to regard concretes as the units of an abstract, singular, mental perspective—a perspective reached in recognition of certain facts of reality.”

The way set theory progresses in this regard is indeed extreme. Here is one way to look at it.

The idea of an infinite set is altogether inconceivable before you first have grasped the concept of infinity. On the other hand, grasping the concept of infinity can be accomplished without any involvement of the set theory anyway—formally or informally. However, since every set you actually observe in the concrete reality can only be finite, and since sets themselves are concrete objects, there is no way to conceive of the very idea of an infinite sets, unless you already know what infinity means (at least in some working, implicit, sense). Thus, to generate the concrete members contained in the given infinite set, you of course need the conceptual knowledge of infinite sequences and series.

However, even if the set theory must use this theoretical apparatus of analysis, the actual mathematical object it ends up having still captures only the “concrete-collection” aspect of it—none other. In other words, the set theory drops from its very considerations some of the crucially important aspects of knowledge with which infinite sets can at all be conceived of. For instance, it drops the idea that the infinite set-generating rule is in itself an abstraction. The set theory asks you to supply and use that rule. The theory itself is merely content in being supplied some well-defined entities as the members of a set.

It is at places like this that the infamous incompleteness creeps into the theory—I mean, the theory of sets, not the theory that is the analysis as was historically formulated and practiced.


The name of a set vs. the word that stands for a concept:

The name given to a set (the symbol or label appearing on the left hand-side of the equation) is just an arbitrary and concrete a label; it is not a theoretical place-holder for the corresponding mental concept—not so long as you remain strictly within the formalism, and therefore, the scope of application of, the set theory.

When they introduce you to the set theory in your high-school, they take care to choose each of the examples only such a way that there always is an easy-to-invoke and well-defined concept; this per-existing concept can then be put into a 1:1 correspondence with the definition of that particular set.

But if you therefore begin thinking that there is a well-defined concept for each possible instance of a set, then such a characterization is only a figment of your own imagination. An idea like this is certainly not to be found in the actual formalism of the set theory.

Show me the place in the axioms, or their combinations, or theorems, or even just lemmas or definitions in the set theory where they say that the label for a set, or the rule for formation of a set, must always stand for a conceptually coherent mental integration. Such an idea is simply absent from the mathematical theory.

The designers of the set theory, to put it directly, simply didn’t have the wits to include such ideas in their theory.


Implications for the allowed operations:

The reason why the set theory allows for any arbitrary operands (including those which don’t make any sense in the real world) is, thus, not an accident. It is a direct consequence of the fact that sets are, by design, concrete aggregates, not mental integrations based on certain rules of cognition (which in turn must make a reference to the actual characteristics and attributes possessed by the actually existing objects).

Since sets are mere aggregations, not integrations, as a consequence, we no longer remain concerned with the fact that there have to be two or more common characteristics to the concrete objects being put together, or with the problem of having to pick up the most fundamental one among them.

When it comes to sets, there are no such constraints on the further manipulations. Thus arises the possibility of being apply any operator any which way you feel like on any given set.


Godel’s incompleteness theorem as merely a consequence:

Given such a nature of the set theory—its glaring epistemological flaws—something like Kurt Godel’s incompleteness theorem had to arrive in the scene, sooner or later. The theorem succeeds only because the set theory (on which it is based) does give it what it needs—viz., a loss of a connection between a word (a set label) and how it is meant to be used (the contexts in which it can be further used, and how).


In the next part, we will reiterate some of these points by looking at the issue of (i) systems of axioms based on the set theory on the one hand, and (ii) the actual conceptual body of knowledge that is arithmetic, on the other hand. We will recast the discussion so far in terms of the “is a” vs. the “has a” types of relationships. The “is a” relationship may be described as the “is an instance of a mental integration or concept of” relationship. The “has a” relationship may be described as “is (somehow) defined (in whatever way) to carry the given concrete” type of a relationship. If you are curious, here is the preview: concepts allow for both types of relationships to exist; however, for defining a concept, the “is an instance or unit of” relationship is crucially important. In contrast, the set theory requires and has the formal place for only the “has a” type of relationships. A necessary outcome is that each set itself must remain only a concrete collection.

 

Advertisements

Learnability of machine learning is provably an undecidable?—part 1

This one news story has been lying around for about a week on my Desktop:

Lev Reyzin, “Unprovability comes to machine learning,” Nature, vol. 65, pp. 166–167, 10 January 2019 [^]. PDF here: [^]

(I’ve forgotten how I came to know about it though.) The story talks about the following recent research paper:

Ben-David et al., “Learnability can be undecidable,” Nature Machine Intelligence, vol. 1, pp. 44–48, January 2019 [^]. PDF here: [^]

I don’t have the requisite background in the theory of the research paper itself, and so didn’t even try to read through it. However, I did give Reyzin’s news article a try. It was not very successful; I have not yet been able to finish this story yet. However, here are a few notings which I made as I tried to progress through this news story. The quotations here all come from from Reyzin’s news story.

Before we begin, take a moment to notice that the publisher here is arguably the most reputed one in science, viz., the Nature publishing group. As to the undecidability of learnability, its apparent implications for practical machine learning, artificial intelligence, etc., are too obvious to be pointed out separately.


“During the twentieth century, discoveries in mathematical logic revolutionized our understanding of the very foundations of mathematics. In 1931, the logician Kurt Godel showed that, in any system of axioms that is expressive enough to model arithmetic, some true statements will be unprovable.”

Is it because Godel [^] assumed that any system of axioms (which is expressive enough to model arithmetic) would be based on the standard (i.e. mathematical) set theory? If so, his conclusion would not be all that paradoxical, because the standard set theory carries, from an epistemological angle, certain ill-conceived notions at its core. [BTW, throughout this (short) series of posts, I use Ayn Rand’s epistemological theory; see ITOE, 2e [^][^].]


To understand my position (that the set theory is not epistemologically sound), start with a simple concept like “table”.

According to Ayn Rand’s ITOE, the concept “table” subsumes all possible concrete instances of tables, i.e., all the tables that conceivably exist, might have ever existed, and might ever exist in future, i.e., a potentially infinite number of concrete instances of them. Ditto, for any other concept, e.g., “chair.” Concepts are mental abstractions that stand for an infinite concretes of a given kind.

Now, let’s try to run away from philosophy, and thereby come to rest in the arms of, say, a mathematical logician like Kurt Godel [^], or preferably, his predecessors, those who designed the mathematical set theory [^].

The one (utterly obvious) way to capture the fact that there exist tables, but only using the actual terms of the set theory, is to say that there is a set called “tables,” and that its elements consist of all possible tables (i.e., all the tables that might have existed, might conceivably exist, and would ever conceivably exist in future). Thus, the notion again refers to an infinity of concretes. Put into the terms of the set theory, the set of tables is an infinite set.

OK, that seems to work. How about chairs? Once again, you set up a set, now called “chairs,” and proceed to dump within its braces every possible or conceivable chair.

So far, so good. No trouble until now.


The trouble begins when you start applying operators to the sets, say by combining them via unions, or by taking their intersections, and so on—all that Venn’s diagram business [^]. But what is the trouble with the good old Venn diagrams, you ask? Well, the trouble is not so much to the Venn diagrams as it is to the basic set theory itself:

the set theory makes the notion of the set so broad that it allows you to combine any sets in any which way you like, and still be able to call the result a meaningful set—meaningful, as seen strictly from within the set theory.

Here is an example. You can not only combine (take union of) “tables” and “chairs” into a broader set called “furniture,” you are also equally well allowed, by the formalism of the set theory, to absorb into the same set all unemployed but competent programmers, Indian HR managers, and venture capitalists from the San Francisco Bay Area. The set theory does not by itself have anything in its theoretical structure, formalism or even mathematical application repertoire, using which it could possibly so much as raise a finger in such matters. This is a fact. If in doubt, refer to the actual set theory ([^] and links therein), take it strictly on its own terms, in particular, desist mixing into it any extra interpretations brought in by you.

Epistemology, on the other hand, does have theoretical considerations, including explicitly formulated rules at its core, which together allow us to distinguish between proper and improper formulations of concepts. For example, there is a rule that the concrete instances being subsumed under a concept must themselves be conceptually commensurate, i.e., they must possess the same defining characteristics, even if possibly to differing degrees. Epistemology prevents the venture capitalists from the San Francisco Bay Area from being called pieces of furniture because it clarifies that they are people, whereas pieces of furniture are inanimate objects, and for this crucial reason, the two are conceptually incommensurate—they cannot be integrated together into a common concept.

To come back to the set theory, it, however, easily allows for every abstractly conceivable “combination” for every possible operand set(s). Whether the operation has any cognitive merit to it or not, whether it results into any meaningful at all or not, is not at all a consideration—not by the design of the set theory itself (which, many people suppose, is more fundamental to every other theory).

So—and get this right—calling the collection of QC scientists as either politicians or scoundrels is not at all an abuse of the mathematical structure, content, and meaning of the set theory. The ability to take an intersection of the set of all mathematicians who publish papers and the set of all morons is not a bug, it is very much a basic and core feature of the set theory. There is absolutely nothing in the theory itself which says that the intersection operator cannot be applied here, or that the resulting set has to be an empty set. None.

Set theory very much neglects the considerations of the kind of a label there is to a set, and the kind of elements which can be added to it.

More on this, later. (This post has already gone past 1000 words.)


The songs section will come at the conclusion of this (short) series of posts, to be completed soon enough; stay tuned…

Vodaphone Idea—“ekdum phaaltu…” (20 to 60 times lower ‘net connection speeds)

Just that. (The Hindi, Marathi (and perhaps also Gujarathi) colloquial expression “ekdum phaaltu” loosely translates to: “utterly worthless.”)


These people [^][^][^] don’t deliver on what they promise.

The service they promise is: 4G. Which means, the speed is supposed to be about 6.1 Mbps because this is India, even though globally, the speed is about 17 Mbps; see a recent Economic Times story here. [^]

The maximum speed (actual, reported by Ubuntu’s system monitor) for the last 3 days measured at my machine has been: less than 300 k-bits per second. (Yes, that is bits, not bytes.) On an average, it’s more like 100–150 kbps, because the connection is simply absent for more than half of the total connection time.

Thus, the actual speed is 20–60 times less[*] than what they say they deliver in India, and about 60–180 times less as compared to the globally available speed.

But while preparing bills, they do charge for the 4G speeds! This is one Indian rope trick!

The time taken today to login into my wordpress blog site and to begin editing this post was: approximately 10 minutes. (Began loading wordpress’ dashboard at 14:02 IST; it was done loading only by 14:12 IST!).

Bad! Pathetically bad!


Addendum on 2019.01.14 on whether the expressions “X times lower than” or “Y times less than” make for good English or not: check out here [^].

Also, realizing that it’s “Makar Sankrant” today, I have deleted the swear words which had appeared in the first couple of versions of this post. Do get in touch with me if you wish to know what these were. The characterization “ekdum phaaltu” is, however, being retained because it objectively describes the basis for the position which in turn had led to the use of the swear words.


A further addendum on 2019.01.14:

A song I find funny:

(Marathi) “bolaa, amrit bolaa…”
Singer: Jyotsna Bhole
Music: Master Krishnarao
Lyrics:  M. G. Rangnekar

[… BTW, the order of the listing of the credits doesn’t matter here, because I honestly can’t settle down on the question of who really makes it more funny. … You must listen to it in order to believe me. … But to think that this song was “avant garde” or “modern” once upon a time! …Ah!…]

 

A bit on Panpsychism—part 2: Why the idea is basically problematic, and what could be a different (and hopefully better) alternative

I continue from my last post. While the last post was fairly straight-forward, the subject-matter of this post itself is such that the writing becomes  meandering.


The basic trouble with panpsychism:

The primary referent for the concept of consciousness refers to one’s own consciousness. The existence of the same faculty in other beings is only an inference drawn from observations. If so, and in view of the two facts discussed in the last post, why can’t a similar inference be extended to everything material, too?

Well, consciousness is observed to exist only in those beings that are in fact alive. Consciousness is fundamental, sure. In Ayn Rand’s system, it even is a philosophical axiom. But qua a metaphysical existent, consciousness also happens to be only an attribute, and that too, of only one class of existents: the living beings.

Here, we will not get into the debate concerning which species can be taken as to be truly conscious, i.e., which species can be said to have an individualized, conscious grasp of reality. Personally, I believe that all living beings are conscious to some extent, even if it be only marginal in the more primitive species such as amoebae or plants.

However, regardless of whether plants can be taken to be conscious or not, we can always say that material entities that are not alive never show any evidence of being conscious. Your credit cards, spectacles, or T-shirts never show any evidence of being engaged in a process of grasping reality, or of having a definite, internal and individualized representation of any aspect of reality—no matter in how diluted, primitive or elementary form it may be posited to exist, or how fleetingly momentary such a grasp may be asserted to be. Consciousness is an attribute of only those beings that actually have life. You can’t tell your credit card to go have a life—it simply cannot. For the same reason, it can’t have the faculty to know anything, speaking literally.

Now, coming to the phenomenon of life, it is delimited on two different counts: (i) Life is an attribute possessed by only some beings in the universe, not all. (ii) Even those beings which are alive at some point of time must eventually die after the elapse of some finite period of time. When they do, their physical constituents are no different from the beings that never were alive in the first place. (This “forward-pass” kind of a logical flow is enough for us here; we need not look into the “backward-pass”, viz., the issue of whether life can arise out of the purely inanimate matter or not. It is a complicated question, and so, we will visit it some time later on.)

The physical constituents of a living organism continue to remain more or less the same after the event of its death. Even if we suppose that there is a permanent loss of some kind of a *physical* constituent or attribute at the time death, for our overall argument (concerning panpsychism) to progress, it is enough to observe and accept that at least **some** of the physical aspects continue to remain the same even after death. The continued existence of at least a part of physical constituents is sufficient to establish the following important conclusion:

Not all physical parts of the universe are at all times associated with living beings.

Given the above conclusion, it is easy to see that to speak of all parts of the reality as possessing consciousness is an elementary error: Not all parts of reality are alive at any point of time, and consciousness is an attribute of only those beings that are alive.


An aside related to reincarnation:

Even if reincarnation exists (and I do believe that it does), what persists in between two life-times is not consciousness, but only the soul.

In my view (derived from the ancient Indian traditions, of course, but also departing from it at places), the term “soul” is to be taken in sense of an individual (Sanskrit) “aatmaa.” An “aatmaa,” in my view is, loosely speaking, the “thing” which is neither created at birth nor destroyed at death. However, it is individual in nature, and remains in common across all the life-times of a given individual. Thus, I do not take the term “soul” in the sense in which Aristotle and Ayn Rand do. (For both Aristotle and Ayn Rand, the soul comes into being at birth, and ceases to exist at death.) Further, in my view, the soul has no consciousness—i.e., no feelings, not even just the desires even. For more details on my view of soul, see my earlier posts, especially these: [^][^][^].

The important point for our present discussion is this: Even if the soul were to be an attribute of all parts of the entire universe (including every inanimate objects contained in it), we still couldn’t ascribe consciousness to the inanimate parts of the universe. That is my main point here.


Another idea worth entertaining—but it is basically different from panpsychism:

Following the above-mentioned analysis, panpsychism can make sense only if what it calls “elements of consciousness” is something that is not in itself conscious, in any sense of the term.

The only idea consistent with its intended outcome can be something like a pre-consciousness, i.e., some feature or attribute or condition which, when combined with life, can give rise to a consciousness.

But note that such a pre-condition cannot mean having an actual capacity for being aware; it cannot represent the ability to have that individualized and internal grasp of reality which goes when actual living beings are actually conscious of something. That is the point to understand. The elements that panpsychism would like to have validated cannot be taken to be conscious the way it asserts they are. The elementary attributes cannot be conscious in the same sense in which we directly grasp our own consciousness, and also use it in our usual perceptions and mental functioning.

Even if you accept the more consistent idea (viz., a pre-conscious condition or a soul which may be associated with the non-living beings too), panpsychism would still have on its hands another problem to solve: if consciousness (or even just the pre-consciousness) is distributed throughout the universe, then for what reason does it get “concentrated” to such glaringly high degrees only in the living beings? For what metaphysical function? To allow for which teleological ends? And, following what kind of a process in particular? And then, what is the teleological or metaphysical function of the elements of consciousness?… From what I gather, they don’t seem to have very good ideas regarding questions and issues like these. In fact, I very much doubt if they at all have _any_ ideas in these respects.


Dr. Sabine Hossenfelder [^] notably does touch upon the animate vs. the inanimate distinction. Congratulations to her!

However, she doesn’t pursue it as much as she could have. Her main position—viz., that electrons don’t think—is reasonable, but as I will show below, this position is inevitable only when you stay within the scope of that abstraction which is the physical reality. Her argument does not become invalid, but it does become superfluous, when it comes to the entirety of existence as such (i.e., the whole universe, including all the living as well, apart from the non-living beings). To better put her position in context (as also those of others), let us perform a simple thought experiment.


The thought experiment to show why the panpsychism is basically a false idea:

Consider a cat kept in a closed wooden box. (Don’t worry; the sides of the box all carry holes, and so, the cat has no problem breathing in a normal way.) Administer some general anathesia to the cat, thereby letting her enter into a state of a kind of a deep sleep, being physically unresponsive—in particular, being unresponsive to the external physical stimuli like a simple motion of the box. Then place the cat in the wooden box, and tie its body to a fixed position using some comfortable harnesses.

If you now apply a gentle external force to the box from the outside, the cat-plus-box system can be easily described (or simulated) using physics; some simple dynamical evolution equations apply in this case. The reason is, even though the cat is a living being, the anaesthesia leads it to temporarily lose consciousness, so that nothing other than its purely physical attributes now enter the system description.

Now repeat the same experiment but when the cat is awake. As the box begins to move, the cat is sure to move its limbs and tail in response, or arch its body, etc. The *physical* attributes of her body enter the system description as before. However these physical attributes themselves are now under the influence of (or are a function of) an additional force—one which is introduced into the system description because of the actions of the consciousness of the cat. For instance, the physical attribute of any changes to the shape of its body are now governed not just by the externally applied forces, but also out of the forces generated by the cat itself, following the actions of her consciousness. (The idea of such an additional physical force is not originally mine; I got it from Dr. Harry Binswanger.) Thus, there are certain continuing physical conditions which depend on consciousness—its actions.

Can we rely on the principles or equations of physical evolution in the second case, too? Are our physical laws valid for describing the second case, too?

The answer is, yes. We can rely on the physics principles so long as we are able to bring the physical actions produced by the consciousness of the cat into our system description. We do so via that extra set of the continuing conditions. Let’s give this extra force the name: “life-physical force.”

Next, suppose the entire motion of this box+cat system occurs on a wooden table. The table (just as the wooden box) is not alive. Therefore, no special life-physical force comes into the picture while calculating the table’s actions. The table acts exactly the same way whether there is only a box, or a box with a non-responsive cat, or a box with a much meowing cat. It simply supplies reaction forces; it does not generate any active action forces.

Clearly, we can explain the actions of the table in purely physical terms. In fact doing so is relatively simpler, because we don’t have to abstract away its physical attributes the way we have to, when the object is a living and conscious being. Clearly, without any loss of generality, we can do away with panpsychism (in any of its versions) when it comes to describing the actions of the table.

Since panpsychism is a redundancy in describing the action of the table, obviously, it cannot apply to the universe as a whole. So, its basic idea is false.

Overall, my position is that panpsychism cannot be taken too seriously “as is”, because it does not discuss the intermediate aspect of life (or the distinction of living vs. non-living beings). It takes what is an attribute of only a part of the existence (the consciousnesses of all living beings), and then directly proceeds to smear it on to the entirety of existence as such. In terms of our thought experiment, it takes the consciousness of the cat and smears it onto not just the wooden box, but also onto the wooden table. But as can be seen with the thought-experiment, this is a big leap of mis-attribution. Yet a panpsychist must perform it, because an entire category of considerations is lacking in it—viz., that related to life.

What possibly would a panpsychist have to do to save his thesis? Let’s see.

Since consciousness metaphysically is only an attribute of a bigger class of entities (viz., that of living beings), the only way to rescue panpsychism would be to assert that the entire universe is always alive. This is the only way to have every part of the universe conscious.

But there are big troubles with such a “solution” too.

This formulation does away with the fact of death. If all beings are always alive, such a universe ceases to contain the fact of death. Thus, the new formulation would smear out the distinction between life and death, because it would have clubbed together both (i) the actions of life or of consciousness, and (ii) the actions of the inanimate matter, into a single, incoherent package—one that has no definition, no identity. That is the basic theoretical flaw of attempting the only way in which panpsychism could logically be saved.

Now, of course, since we have given a lifeline (pun intended) to the panpsychist, he could grab it and run with it with some further verbal gymnastics. He could possibly re-define the very life (i.e. living-ness) as a term that is not to be taken in the usual sense, but only in some basic, “elementary,” or “flavour”-some way. Possible… What would be wrong with that?

… The wrong thing is this: There are too many flavors now blurring out too many fundamental distinctions, but too few cogent definitions for all these new “concepts” of what it means to be a mere “flavour.”… Realize that the panpsychist would not be able to directly point out to a single instance of, say the table (or your T-shirt) as having some element of same kind of live which actually is present with the actual living beings.

If an alleged consciousness (or its elementary flavor or residue) cannot perform even a single action of distinguishing something consciously, but only follows the laws of physics in its actions, then what it possesses is not consciousness. Further, if an allegedly elementary form of life can have unconditional existence and never faces death, and leads to no actions other than those which follow from the laws of physics alone, then what it possesses is not life—not even in the elementary sense of the term.

In short, panpsychism is an untenable thesis.

Finally, let me reiterate that when I said that a pre-condition (or pre-consciousness, or “soul”) can remain associated with the inanimate matter too, that idea belongs to an entirely different class. It is not what panpsychists put forth.


Comments on what other bloggers have said, and a couple of relevant asides:

For the reasons discussed above, Motl[^]’s “proof” regarding panpsychism cannot be accepted as being valid—unless he, Koch, Chalmers, or others clarify what exactly they mean by terms such as “elementary” consciousness. Also, the elementary bits of “life”: can there be a \Phi of life too, and if yes, how does \Phi = 0 differ from ordinary loss of life (i.e. death) and the attendent loss of the \Phi of consciousness too.

As to Hossenfelder‘s post, if a given electron does not belong to the body of a conscious (living) being, then there exist no further complications in its physical evolution; the initial and boundary conditions specified in the purely physical terms are enough to describe its actions, its dynamical evolution, to the extent that such an evolution can at all be described using physics.

However, if an electron belongs to a conscious (living) being, then the entire of consideration of whether the electron by itself is conscious or not, whether it by itself thinks or not, becomes completely superfluous. The evolution of its motion now occurs under necessarily different conditions; you now have to bring the physical forces arising due to the action of life, of consciousness, via those additional continuing conditions. Given these additional forces, the system evolution once again follows the laws of physics. The reason for that, in turn, is this: whether an elementary particle like the electron itself is conscious or not, a big entity (like a man) surely is conscious, and the extra physical effects generated by this consciousness do have to be taken into account.

An aside: Philosophy of mind is not a handmaiden to physics or its philosophy:

While on this topic, realize that you don’t have to ascribe consciousness to the electrons of a conscious (living) being. For all you know, there could perhaps be an entirely new kind of a field (or a particle) which completely explains the phenomenon of consciousness, and so, electrons (or other particles of the standard model) can continue to remain completely inanimate at all times. We don’t know if such a field exists or not.

However, my main point here is that we don’t have to exhaust this question without observation; we don’t have to pre-empt this possibility by arbitrarily choosing to hinge the entire debate only on the particles of the standard model of physics, and slapping consciousness onto them.

Realize that the abstraction of consciousness (and all matters pertaining to it or preceding it, like the soul), is fundamentally “orthogonal” to the abstractions of physics, of physical reality. (Here, see my last post.) You don’t commit the error of taking a model (even the most comprehensive model) of physics, and implicitly ask philosophy of mind to restrict its scope to this model (which itself may get revised later on!) Physics might not be a handmaiden to philosophy, but neither is philosophy a handmaiden to physics.

Finally coming to Schlafly‘s post, he too touches upon Hossenfelder’s post, but he covers it from the advance viewpoints of free-will, mind-body connection, Galen’s argument etc.[^]. I won’t discuss his post or positions in detail here because these considerations indeed are much more complicated and advanced.

Another aside: How Galen’s argument involves a superfluous consideration:

However, one point that can be noted here is that Galen fails to make the distinction of whether the atom he considers exists as a part of a conscious (living) being’s body, or whether it is a part of some inanimate object. In the former case, whether the electron itself is conscious or not (and whether there is an extra particle or field of consciousness or not, and whether there is yet another field or particle to explain the phenomenon of life or not), a description of the physical evolution of the system would still have to include the aforementioned life-physical force. Thus, the issue of whether the electron is conscious or not is a superfluous consideration. In other words, Galen’s argument involves a non-essential consideration, and therefore, it is not potent enough to settle the related issues.


Homework for you:

  • If panpsychism were to be true, your credit card, spectacles, or T-shirt would be conscious in some “elementary” sense, and so, they would have to be able to hold some “elementary” items of cognition. The question is, where and through what means do you suppose it might be keeping it? That is to say, what are the physical (or physico-electro-chemical-etc.) correlates for their content of consciousness? For instance, can a tape-recorder be taken to be conscious? Can the recording on the tape be taken as the storage of its “knowledge”? If you answer “yes,” then extend the question to the tape of the tape-recorder. Can it be said to be conscious?
  • Can there be a form of consciousness which does not carry a sense of self even in the implicit terms? As it so actually happens, i.e., in reality, a conscious being doesn’t have to be able to isolate and consciously hold that it has self; but it only has to act with a sense of its own life, its own consciousness. The question asks whether, hypothetically, we can do away with that implicit sense of its own life and consciousness itself, or not.
  • Can there be a form of consciousness which comes without any mind-body integrating mechanisms such as some kinesthetic senses of feedback, including some emotions (perhaps even just so simple emotions such as the pleasure-pain mechanism)? Should there be medical specializations for addressing the mental health issues of tables? of electric switches? of computers?
  • Could, by any stretch of imagination, the elementary consciousness (as proposed by panpsychists) be volitional in nature?
  • Should there be a law to protect the rights of your credit card? of your spectacles? of your T-shirt? of a tape-recorder? of your laptop? of an artificial neural network running on your laptop?
  • To those who are knowledgeable about ancient Indian wisdom regarding the spiritual matters, and wish to trace panpsychism to it: If a “yogi” could do “tapascharyaa” even while existing only as an “aatmaa” i.e. even when he is not actually alive, then why should he at all have to take a birth? Why do they say that even “deva”s also have to take a human birth in order to break the bonds of “karma” and thereby attain spiritual purity?

More than three thousand words (!!) but sometimes it is necessary. In any case, I just wanted to finish off this topic so that I could return full-time to Data Science. (I will, however, try to avoid this big a post the next time; cf. my NYRs—2019 edition [^].)


A song I like:
(Marathi) “santha vaahate krushNaa maai”
Lyrics: Ga. Di. Madgulkar
Music: Datta Davajekar
Singer: Sudhir Phadke

 

A bit on Panpsychism—part 1: what its basis possibly could be

Panpsychism is an interesting theory from the philosophy of the mind [^] . This topic has a long history, and it has recently been put forth in a very engaging form by an Australian-American professor, Dr. David Chalmers [^]. I gather that there also have been others like Prof. Giulio Tononi [^] and Dr. Kristoph Koch [^]. However, I have not yet read them or watched their videos. So,  my discussion of panpsychism is going to be limited to what I understand about this theory after listening to only Prof. Chalmers. Prof. Chalmers discusses panpsychism mainly in the context of “the hard problem of consciousness.”

I had last year (in 2018) listened to Prof. Chalmer’s TedX talks, and also had browsed through some of his writings. However, I didn’t think of writing a post about it. The reason I am now writing this post is that several physicist have recently come to discuss it. See: “Electrons don’t think” by Dr. Sabine Hossenfelder [^] ; “Panpsychism is needed to quantify consciousness” by Dr. Lubos(h) Motl [^] , and “The mind-body problems” by Dr. Roger Schlafly [^].

In these couple of posts (this one and the next), I am going to note a few points about panpsychism—what I think of it, based on just some surface reading (and watching videos) on the topic by Prof. David Chalmers. My write-up here is exploratory, and for that reason, a bit meandering.


Panpsychism says (going by the definition of the term thrown up by a Google search on the word) that “everything material, however small, has an element of individual consciousness.” For this post, we will assume that this definition correctly characterizes panpsychism. Also see the Google Ngram, at this link: [^]


The thesis of panpsychism seems to have the following two ideas at its base:

(i) What we perceive can cut across the entirety of the existence.

There are no sub-categories of beings (or parts of existence) that can in principle (i.e. directly or indirectly) remain permanently inaccessible to us, i.e., to our means and methods of cognition. For instance, consider the fact that a technique like SEM (scanning electron microscope) can bring certain spatial features of bacteria or of nano-scale structures to a high-fidelity representation that is within the range of our direct perception. Something similar, for the idiot box in your room—it brings a remote scene “to life” in your room.

Notice that this philosophical position means: a denial of a “second” (or “third” etc.) world that is permanently inaccessible to the rest of us, but one that is, somehow, definitely accessible to philosophers of mysticism such as Plato or Kant.

(ii) The idea that what we perceive includes both the realms: the physical realm, and the realm of the mind or consciousness.

Obviously, by the “realm” of consciousness, we don’t mean a separate world. We here take the word “realm” in the sense of just a collective noun for such things as: the contents of consciousnesses, their actions, the products of their processes, etc., as beings having consciousness are observed to exist and be conscious of this world (or take conscious actions in it).

By the idea of the two abstract realms—physical and consciousness-related—we mean a categorically improved version of the Cartesian division—which is to say that our realms have no connection whatsoever to the actual Cartesian division.

[I don’t know if all advocates of panpsychism accept the above two ideas or not. However, when I began wondering what could possibly be the theoretical bases of this idea (of panpsychism), these two seemed to be the right kind of bases.]


Given the above two ideas, the logic of panpsychism basically seems to go something like this:

Since the world we can directly or indirectly perceive is all there is to existence, and since our perception includes both the physical and the consciousness-related aspects, therefore, we should take a direct jump to the conclusion that any part of the existence must carry both kinds of attributes—physical, and the consciousness-pertaining.


If you ask me, there is a problem with this position (of panpsychism). I will cover it in a separate post later this week. I would like to see whether, knowing the fact that I find the logic problematic, you would want to give it a try as to what the reasoning could be like, so that we could cross-check our notes. … Happy thinking!

Bye for now… [The songs section will come back in the next part, to be posted soon enough.]


Originally published on 2019.01.06 14:59 IST. Slightly revised (without introducing any new point) on 2019.01.07 10:15 IST.

Data science code snippets—2: Using world’s simplest ANN to visualize that deep learning is hard

Good morning! [LOL!]


In the context of machine learning, “learning” only means: changes that get effected to the weights and biases of a network. Nothing more, nothing less.

I wanted to understand exactly how the learning propagates through the layers of a network. So, I wrote the following piece of code. Actually, the development of the code went through two distinct stages.


First stage: A two-layer network (without any hidden layer); only one neuron per layer:

In the first stage, I wrote what demonstrably is world’s simplest possible artificial neural network. This network comes with only two layers: the input layer, and the output layer. Thus, there is no hidden layer at all.

The advantage with such a network (“linear” or “1D” and without hidden layer) is that the total error at the output layer is a function of just one weight and one bias. Thus the total error (or the cost function) is a function of only two independent variables. Therefore, the cost function-surface can (at all) be directly visualized or plotted. For ease of coding, I plotted this function as a contour plot, but a 3D plot also should be possible.

Here are couple of pictures it produces. The input randomly varies in the interval [0.0, 1.0) (i.e., excluding 1.0), and the target is kept as 0.15.

The following plot shows an intermittent stage during training, a stage when the gradient descent algorithm is still very much in progress.

SImplest network with just input and output layers, with only one neuron per layer: Gradient descent in progress.

The following plot shows when the gradient descent has taken the configuration very close to the local (and global) minimum:

SImplest network with just input and output layers, with only one neuron per layer: Gradient descent near the local (and global) minimum.


Second stage: An n-layer network (with many hidden layers); only one neuron per layer

In the second stage, I wanted to show that even if you add one or more hidden layers, the gradient descent algorithm works in such a way that most of the learning occurs only near the output layer.

So, I generalized the code a bit to have an arbitrary number of hidden layers. However, the network continues to have only one neuron per layer (i.e., it maintains the topology of the bogies of a train). I then added a visualization showing the percent changes in the biases and weights at each layer, as learning progresses.

Here is a representative picture it produces when the total number of layers is 5 (i.e. when there are 3 hidden layers). It was made with both the biases and the weights all being set to the value of 2.0:

It is clearly seen that almost all of learning is limited only to the output layer; the hidden layers have learnt almost nothing!

Now, by way of a contrast, here is what happens when you have all initial biases of 0.0, and all initial weights of 2.0:

Here, the last hidden layer has begun learning enough that it shows some visible change during training, even though the output layer learns much more than does the last hidden layer. Almost all learning is via changes to weights, not biases.

Next, here is the reverse situation: when you have all initial biases of 2.0, but all initial weights of 0.0:

The bias of the hidden layer does undergo a slight change, but in an opposite (positive) direction. Compared to the case just above, the learning now is relatively more concentrated in the output layer.

Finally, here is what happens when you initialize both biases and weights to 0.0.

The network does learn (the difference in the predicted vs. target does go on diminishing as the training progresses). However, the percentage change is too small to be visually registered (when plotted to the same scale as what was used earlier).


The code:

Here is the code which produced all the above plots (but you have to suitably change the hard-coded parameters to get to each of the above cases):

'''
SimplestANNInTheWorld.py
Written by and Copyright (c) Ajit R. Jadhav. All rights reserved.
-- Implements the case-study of the simplest possible 1D ANN.
-- Each layer has only neuron. Naturally, there is only one target!
-- It even may not have hidden layers. 
-- However, it can have an arbitrary number of hidden layers. This 
feature makes it a good test-bed to see why and how the neurons 
in the hidden layers don't learn much during deep learning, during 
a ``straight-forward'' application of the gradient descent algorithm. 
-- Please do drop me a comment or an email 
if you find this code useful in any way, 
say, in a corporate training setup or 
in academia. Thanks in advance!
-- History:
* 30 December 2018 09:27:57  IST: 
Project begun
* 30 December 2018 11:57:44  IST: 
First version that works.
* 01 January 2019 12:11:11  IST: 
Added visualizations for activations and gradient descent for the 
last layer, for no. of layers = 2 (i.e., no hidden layers). 
* 01 January 2019 18:54:36  IST:
Added visualizations for percent changes in biases and 
weights, for no. of layers >=3 (i.e. at least one hidden layer).
* 02 January 2019 08:40:17  IST: 
The version as initially posted on my blog.
'''
import numpy as np 
import matplotlib.pyplot as plt 

################################################################################
# Functions to generate the input and test data

def GenerateDataRandom( nTrainingCases ):
    # Note: randn() returns samples from the normal distribution, 
    # but rand() returns samples from the uniform distribution: [0,1). 
    adInput = np.random.rand( nTrainingCases )
    return adInput

def GenerateDataSequential( nTrainingCases ):
    adInput = np.linspace( 0.0, 1.0, nTrainingCases )
    return adInput

def GenerateDataConstant( nTrainingCases, dVal ):
    adInput = np.full( nTrainingCases, dVal )
    return adInput

################################################################################
# Functions to generate biases and weights

def GenerateBiasesWeightsRandom( nLayers ):
    adAllBs = np.random.randn( nLayers-1 )
    adAllWs = np.random.randn( nLayers-1 )
    return adAllBs, adAllWs

def GenerateBiasesWeightsConstant( nLayers, dB, dW ):
    adAllBs = np.ndarray( nLayers-1 )
    adAllBs.fill( dB )
    adAllWs = np.ndarray( nLayers-1 )
    adAllWs.fill( dW )
    return adAllBs, adAllWs

################################################################################
# Other utility functions

def Sigmoid( dZ ):
    return 1.0 / ( 1.0 + np.exp( - dZ ) )

def SigmoidDerivative( dZ ):
    dA = Sigmoid( dZ )
    dADer = dA * ( 1.0 - dA )
    return dADer

# Server function. Called with activation at the output layer. 
# In this script, the target value is always one and the 
# same, i.e., 1.0). 
# Assumes that the form of the cost function is: 
#       C_x = 0.5 * ( dT - dA )^2 
# where, note carefully, that target comes first. 
# Hence the partial derivative is:
#       \partial C_x / \partial dA = - ( dT - dA ) = ( dA - dT ) 
# where note carefully that the activation comes first.
def CostDerivative( dA, dTarget ):
    return ( dA - dTarget ) 

def Transpose( dA ):
    np.transpose( dA )
    return dA

################################################################################
# Feed-Forward

def FeedForward( dA ):
    ## print( "\tFeed-forward" )
    l_dAllZs = []
    # Note, this makes l_dAllAs have one extra data member
    # as compared to l_dAllZs, with the first member being the
    # supplied activation of the input layer 
    l_dAllAs = [ dA ]
    nL = 1
    for w, b in zip( adAllWs, adAllBs ):
        dZ = w * dA + b
        l_dAllZs.append( dZ )
        # Notice, dA has changed because it now refers
        # to the activation of the current layer (nL) 
        dA = Sigmoid( dZ )  
        l_dAllAs.append( dA )
        ## print( "\tLayer: %d, Z: %lf, A: %lf" % (nL, dZ, dA) )
        nL = nL + 1
    return ( l_dAllZs, l_dAllAs )

################################################################################
# Back-Propagation

def BackPropagation( l_dAllZs, l_dAllAs ):
    ## print( "\tBack-Propagation" )
    # Step 1: For the Output Layer
    dZOP = l_dAllZs[ -1 ]
    dAOP = l_dAllAs[ -1 ]
    dZDash = SigmoidDerivative( dZOP )
    dDelta = CostDerivative( dAOP, dTarget ) * dZDash
    dGradB = dDelta
    adAllGradBs[ -1 ] = dGradB

    # Since the last hidden layer has only one neuron, no need to take transpose.
    dAPrevTranspose = Transpose( l_dAllAs[ -2 ] )
    dGradW = np.dot( dDelta, dAPrevTranspose )
    adAllGradWs[ -1 ] = dGradW
    ## print( "\t* Layer: %d\n\t\tGradB: %lf, GradW: %lf" % (nLayers-1, dGradB, dGradW) )

    # Step 2: For all the hidden layers
    for nL in range( 2, nLayers ):
        dZCur = l_dAllZs[ -nL ]
        dZCurDash = SigmoidDerivative( dZCur )
        dWNext = adAllWs[ -nL+1 ]
        dWNextTranspose = Transpose( dWNext )
        dDot = np.dot( dWNextTranspose, dDelta ) 
        dDelta = dDot * dZCurDash
        dGradB = dDelta
        adAllGradBs[ -nL ] = dGradB

        dAPrev = l_dAllAs[ -nL-1 ]
        dAPrevTrans = Transpose( dAPrev )
        dGradWCur = np.dot( dDelta, dAPrevTrans )
        adAllGradWs[ -nL ] = dGradWCur

        ## print( "\tLayer: %d\n\t\tGradB: %lf, GradW: %lf" % (nLayers-nL, dGradB, dGradW) )

    return ( adAllGradBs, adAllGradWs )

def PlotLayerwiseActivations( c, l_dAllAs, dTarget ):
    plt.subplot( 1, 2, 1 ).clear()
    dPredicted = l_dAllAs[ -1 ]
    sDesc = "Activations at Layers. Case: %3d\nPredicted: %lf, Target: %lf" % (c, dPredicted, dTarget) 
    plt.xlabel( "Layers" )
    plt.ylabel( "Activations (Input and Output)" )
    plt.title( sDesc )
    
    nLayers = len( l_dAllAs )
    dES = 0.2	# Extra space, in inches
    plt.axis( [-dES, float(nLayers) -1.0 + dES, -dES, 1.0+dES] )
    
    # Plot a vertical line at the input layer, just to show variations
    plt.plot( (0,0), (0,1), "grey" )
    
    # Plot the dots for the input and hidden layers
    for i in range( nLayers-1 ):
        plt.plot( i, l_dAllAs[ i ], 'go' )
    # Plot the dots for the output layer
    plt.plot( nLayers-1, dPredicted, 'bo' )
    plt.plot( nLayers-1, dTarget, 'ro' )
    
def PlotGradDescent( c, dOrigB, dOrigW, dB, dW ):
    plt.subplot( 1, 2, 2 ).clear()
    
    d = 5.0
    ContourSurface( d )
    plt.axis( [-d, d, -d, d] ) 
    plt.plot( dOrigB, dOrigW, 'bo' )
    plt.plot( dB, dW, 'ro' )
    plt.grid()
    plt.xlabel( "Biases" )
    plt.ylabel( "Weights" )
    sDesc = "Gradient Descent for the Output Layer.\n" \
    "Case: %3d\nWeight: %lf, Bias: %lf" % (c, dW, dB) 
    plt.title( sDesc )
    
    
def ContourSurface( d ):
    nDivs = 10
    dDelta = d / nDivs
    w = np.arange( -d, d, dDelta )
    b = np.arange( -d, d, dDelta )
    W, B = np.meshgrid( w, b ) 
    A = Sigmoid( W + B )
    plt.imshow( A, interpolation='bilinear', origin='lower',
                cmap=plt.cm.Greys, # cmap=plt.cm.RdYlBu_r, 
                extent=(-d, d, -d, d), alpha=0.8 )
    CS = plt.contour( B, W, A )
    plt.clabel( CS, inline=1, fontsize=7 )

def PlotLayerWiseBiasesWeights( c, adOrigBs, adAllBs, adOrigWs, adAllWs, dPredicted, dTarget ):
    plt.clf()
    
    nComputeLayers = len( adOrigBs )
    plt.axis( [-0.2, nComputeLayers+0.7, -320.0, 320.0] )
    
    adBPct = GetPercentDiff( nComputeLayers, adAllBs, adOrigBs )
    adWPct = GetPercentDiff( nComputeLayers, adAllWs, adOrigWs )
    print( "Case: %3d" \
    "\nPercent Changes in Biases:\n%s" \
    "\nPercent Changes in Weights:\n%s\n" \
     % (c, adBPct, adWPct)  )
    adx = np.linspace( 0.0, nComputeLayers-1, nComputeLayers )
    plt.plot( adx + 1.0, adWPct, 'ro' )
    plt.plot( adx + 1.15, adBPct, 'bo' )
    plt.grid()
    plt.xlabel( "Layer Number" )
    plt.ylabel( "Percent Change in Weight (Red) and Bias (Blue)" )
    sTitle = "How most learning occurs only at an extreme layer\n" \
    "Percent Changes to Biases and Weights at Each Layer.\n" \
    "Training case: %3d, Target: %lf, Predicted: %lf" % (c, dTarget, dPredicted) 
    plt.title( sTitle )

def GetPercentDiff( n, adNow, adOrig ):
    adDiff = adNow - adOrig
    print( adDiff )
    adPct = np.zeros( n )
    dSmall = 1.0e-10
    if all( abs( adDiff ) ) > dSmall and all( abs(adOrig) ) > dSmall:
        adPct = adDiff / adOrig * 100.0
    return adPct


################################################################################
# The Main Script
################################################################################

dEta = 1.0 # The learning rate
nTrainingCases = 100
nTestCases = nTrainingCases // 5
adInput = GenerateDataRandom( nTrainingCases ) #, 0.0 )
adTest = GenerateDataRandom( nTestCases )

np.random.shuffle( adInput )
## print( "Data:\n %s" % (adInput) )

# Must be at least 2. Tested up to 10 layers.
nLayers = 2
# Just a single target! Keep it in the interval (0.0, 1.0), 
# i.e., excluding both the end-points of 0.0 and 1.0.

dTarget = 0.15

# The input layer has no biases or weights. Even the output layer 
# here has only one target, and hence, only one neuron.
# Hence, the weights matrix for all layers now becomes just a 
# vector.
# For visualization with a 2 layer-network, keep biases and weights 
# between [-4.0, 4.0]

# adAllBs, adAllWs = GenerateBiasesWeightsRandom( nLayers )
adAllBs, adAllWs = GenerateBiasesWeightsConstant( nLayers, 2.0, 2.0 )
dOrigB = adAllBs[-1]
dOrigW = adAllWs[-1]
adOrigBs = adAllBs.copy()
adOrigWs = adAllWs.copy()

## print( "Initial Biases\n", adAllBs )
## print( "Initial Weights\n", adAllWs )

plt.figure( figsize=(10,5) )
    
# Do the training...
# For each input-target pair,
for c in range( nTrainingCases ):
    dInput = adInput[ c ]
    ## print( "Case: %d. Input: %lf" % (c, dInput) )
    
    adAllGradBs = [ np.zeros( b.shape ) for b in adAllBs ]
    adAllGradWs = [ np.zeros( w.shape ) for w in adAllWs ]
    
    # Do the feed-forward, initialized to dA = dInput
    l_dAllZs, l_dAllAs = FeedForward( dInput )
    
    # Do the back-propagation
    adAllGradBs, adAllGradWs = BackPropagation( l_dAllZs, l_dAllAs )

    ## print( "Updating the network biases and weights" )
    adAllBs = [ dB - dEta * dDeltaB 
                for dB, dDeltaB in zip( adAllBs, adAllGradBs ) ]
    adAllWs = [ dW - dEta * dDeltaW 
                for dW, dDeltaW in zip( adAllWs, adAllGradWs ) ]
    
    ## print( "The updated network biases:\n", adAllBs )
    ## print( "The updated network weights:\n", adAllWs )

    if 2 == nLayers:
        PlotLayerwiseActivations( c, l_dAllAs, dTarget )
        dW = adAllWs[ -1 ]
        dB = adAllBs[ -1 ]
        PlotGradDescent( c, dOrigB, dOrigW, dB, dW )
    else:
        # Plot in case of many layers: Original and Current Weights, Biases for all layers
        # and Activations for all layers 
        dPredicted = l_dAllAs[ -1 ]
        PlotLayerWiseBiasesWeights( c, adOrigBs, adAllBs, adOrigWs, adAllWs, dPredicted, dTarget )
    plt.pause( 0.1 )

plt.show()

# Do the testing
print( "\nTesting..." )
for c in range( nTestCases ):
    
    dInput = adTest[ c ]
    print( "\tTest Case: %d, Value: %lf" % (c, dInput) )
    
    l_dAllZs, l_dAllAs = FeedForward( dInput )
    dPredicted = l_dAllAs[ -1 ]
    dDiff = dTarget - dPredicted
    dCost = 0.5 * dDiff * dDiff
    print( "\tInput: %lf, Predicted: %lf, Target: %lf, Difference: %lf, Cost: %lf\n" % (dInput, dPredicted, dTarget, dDiff, dCost) )

print( "Done!" )

 


Things you can try:

  • Change one or more of the following parameters, and see what happens:
    • Target value
    • Values of initial weights and biases
    • Number of layers
    • The learning rate, dEta
  • Change the cost function; e.g., try the linear function instead of the Sigmoid. Change the code accordingly.
  • Also, try to conceptually see what would happen when the number of neurons per layer is 2 or more…

Have fun!


A song I like:

(Marathi) “pahaaTe pahaaTe malaa jaag aalee”
Music and Singer: C. Ramchandra
Lyrics: Suresh Bhat

 

My new year’s resolutions—2019 edition

Here are my resolutions for the new year:


1. Get a suitable job in Data Science in Pune.

Revise the resume and upload / send out by January-end.


2. Wrap up my research on the non-relativistic QM:

Get a Beamer presentation (containing all the main points of the paper) out by Date-1.

Get the first version of the QM paper out, by Date-2

Submit the paper to a suitable journal which accepts papers on the foundations, by Date-3.

The optimistic-realistic-pessimistic estimates for the dates are:
Date-1: 28 Feb — 31 March — 31 May
Date-2: 31 March — 31 May — 31 July
Date-3: 30 April — 30 June — 31 August

The reason for the somewhat larger variance is that I will also be working in Data Science, and probably also just beginning doing a job in it. Much depends on how circumstances work out in this regard.

It’s very likely that QM will cease to be of much interest to me after that, though, of course, I will keep myself available for discussions concerning this paper. Another exception is noted near the end.


3. Self-metered ‘net access:

No more than just one hour of general surfing per day, preferably 45 minutes. The time spent on blogging, browsing news sites, reading personal blogs, emails, etc. is included.

Time taken by the OS and software updates, and by large downloads like software libraries, large data-sets, etc. is not included. Any time possibly spent on programming in the cloud, on browsing tutorials, help on python libraries etc., also is not included.


4. Daily exercises and meditation:

To quantify: “Daily” here means on at least 300 days a year.

Do some mild exercises at home, daily. (Really mild form of “exercise”s. In the main, just a few stretching exercises, and a few “surya-namaskaar”s. No need to exceed 12–15 “namaskaar”s a day eventually; begin with just 3 to 5.)

Offer a small prayer at home. Just about 10–15 minutes, but try to do it daily. (No particular time-slot in the day.)

Meditate more regularly, say for 15–30 minutes, at least 4 times a week. At least 10 minutes on the remaining days, just to keep a continuity.

Try: Taking a morning walk to a nearby garden at least 3 times a week, preferably daily (rainy days excluded). (Currently doable, but it depends on where I get my next job. If I have to spend some 3–4 hours a day in just commuting (the way I did during 2015–16), then no guilt for dropping this resolution.)

Come to think of it, I have done this all for extended periods (of several years). It was just that since moving to Mumbai (in 2013) onwards, there occurred a break. I am just going to pick up once again these good habits. All in all, should be easy to keep this resolution. Let’s see how it turns out.


5. Eat more salads:

Once in a job, try to have mostly just salads for the lunch (thus ensuring 5 meals a week predominantly of salads). Even otherwise, try to have salads for lunches, for about 15 days out of a month.

I have tried eating salads, and have found that, once again, this resolution too should be pretty easy to follow. Indeed, this is going to be the easiest one for me to keep. The reason is: really good salad services are available in Pune these days—not just veg. salads but also the greens + nonveg type of salads.


6. Begin cultivating a pet-project in Data Science:

Settle on an area and begin working on it this year.

The topic is likely to be: Rain-fall predictions.

A good, more modest initial goal might be to build a model for predicting just the temperatures in between October through May. That’s because predictions for temperatures in this period, I guess, would mostly involve only temperature and wind-speed data, which should be more easily available. (Data and predictions for pressure, humidity, and rainfalls might turn out to be a more difficult story.)


Things noticeably absent from my resolutions:

1. Restrictions on food & drinks. The idea is that the above resolutions themselves should lead to a better lifestyle so that restrictions as such aren’t necessary. And, in case the above resolutions also get broken, then trying to observe restrictions on just food and drinks is going to be pretty artificial, even just a “duty”! To be avoided.

2. Some other “Good habit”s like maintaining records of expenses on a daily basis, writing diary, etc. I just cannot maintain such things on a regular basis, so no point in making any resolutions about them.


Other things on the todo lists (though not resolutions):

1. After getting a job in Data Science, also try to explore a job as an Adjunct/Affiliate Professor in Pune. No more than 6 hours of commitment per week, including any time spent guiding student projects. For about 2 hours / week, even pro-bono positions can be considered, if the college is convenient for commute. Only for the computational topics of: Data Science / FEM / CFD / computational QM.

2. If possible, begin exploring relativistic QM. No time-frame being specified for its studies. It will be just an exploration. The only reason to include it here is that I believe my new approach is likely to simplify understanding the relativistic QM as well; so would just like to explore the simplest theoretical topics (at UG level) concerning the relativistic QM as well. (So far, I have summarily ignored it, but from now on, especially in the second half of the year, and especially my paper on non-relativistic QM is out, I would like to pursue it, just a bit.)

3. Participate in a Kaggle competition, especially in the second half of this year—purely for fun. If possible, do it right in the first half (though because of QM and all, it might not be possible, though if I get someone else suitable to form a team, this option would be still open).


Changes at this blog:

1. For the songs section, now on, I may from now on repeat some of the songs I have already run here.

It sometimes so happens that a song comes to me very naturally, I like it too, but it’s just that because I noted it on the blog some time ago, I cannot mention it again. In the new year, I am going to break this (self-made) rule.

2. I will also try to reduce the length of blog posts, preferably to within 1000 words/entry


A song I like:

(Western, instrumental): The title song of the movie “Chariots of Fire.”
Music: Vangelis. [Inspired from the song “City of Violets” by Stavros Logarides? See the note below.]

Note: I guess I had missed this movie (though I had watched its trailers in the movie halls many times back then in the early 1980s). Thus, the version of this song which I first listened to probably was not the original one [^], but some later rendition by someone / some orchestra, very possibly, that by Paul Mauriat. My primary memory of this song refers to this later version. Yesterday, when I checked out Paul Mauriat’s version [^], I felt that this wasn’t it. Some time in between, there also appeared a rendition by Yanni [^], and I liked it too. (I am sure that I had listened to this song before Yanni’s version came on this scene). Finally, just yesterday, I also caught, for the very first time, the London Olympics 2012 version (i.e., “Isles of Wonder” by the London Symphony Orchestra); once again, I found that it was a great rendition [^]. … It’s wonderful to see different emphases being made to the same “tune.”

Today, if I have to make a choice, I would pick up Paul Mauriat’s version [^] as the one I like best.

Incidentally, yesterday, while browsing the Wikipedia for this movie and the song, I also got to know for the first time about the plagiarism controversy involving this song [^], and so, I checked out Stavros Logarides’ song: “City of Violets” [^], especially this version [^]. The similarity is plain unmistakable. Even if Vangelis is a reputed composer, and has won not just the academy award but also the court-case (about the alleged plagiarism), if you ask me, the similarity is sufficient that I have no choice but to note Logarides’ name as well. After all, his song historically came first—whether Vangelis was inspired from it or not!


My approach, my credit:

The song controversy again highlights the reason why care must be taken by any original author, for protecting his IPR. … Another reason why I have been insisting on holding those informal seminars in the physics departments in this country, and why I got upset when all these physicists declined me.

The latest email I wrote (a couple of days ago) has been to Prof. Sunil Mukhi, HoD Physics, IISER Pune [^]; he also maintains this blog [^]. I wrote that email with a cc to Prof. Nilima Gupte [^] of IIT Madras, my alma mater. (Gupte and Mukhi were students at SUNY Stony Brook at the same time, I had gathered years ago, while reading the blog maintained by Gupte’s late husband.) As of this writing, I still await Mukhi’s reply.

The reason now to rush up at least a set of presentation slides (on my new approach to QM) has also to do with the fact that my computer was broken into, over the past few months. Best to hurry up the publication. Thus the resolution # 2 above.


Anyway, enough is enough. Any further editing will be very minor one, and even if I effect it, there won’t be any additions to my NYRs, for sure! For the same reason, I won’t even separately note such minor updates.

Bye for now, take care, and wish you all a happy (and a prosperous) new year!

Data science code snippets—1: My first ANN (a copy of Nielsen’s)

I took the liberty to re-type Nielsen’s[^] code [^], in the process introducing the Hungarian naming convention for variables, and also many other non-Pythonic and non-Nielsenic whimsies. Here is the end-result, for whatever worth it is. (Its sole intended purpose was to run the feed-forward and back-propagation algorithms in IDE.) Tested just a slight bit on Ubuntu Release 18.04.1 LTS (Bionic Beaver) 64-bit + Python 3.7.0 64-bit in conda 4.5.11 + MS VSCode 1.30.1.

It works. (Has to. Except for the cosmetic changes, it’s basically Nielsen’s code.) It produces this graph as an output:

Total Errors vs. Epochs

“Enjoy!”

'''
ANN.Nielsen.SmallerAndEasiestDataVectors.py
-- Simple Artificial Neural Network.
-- This code very closely follows "network.py given in Nielsen's book
(chapter 1). It was modified a bit by Ajit R. Jadhav. 
This code Copyright (c) Ajit R. Jadhav. All rights reserved.
-- Meant for developing understanding of ANNs through debugging in IDE 
(with ability to inspect values live at run-time that also can be 
verified manually, by taking *smaller* and *simpler* vectors for the input 
and target data-sets).
-- Includes functions for generation of small and simple 1D input vectors 
and their corresponding target vectors. The input data generated here is, 
effectively, just a randomized version of Nielsen's "vectorized_result" 
function.
-- Other features of this code:
* Writes the total error for each epoch, to a file.
* Plots the total error vs. epoch number. 
* Follows the Hungarian naming convention. However, the variable names 
may not reflect the best possible choices; this code was written in a hurry. 
-- Do whatever you like with it, but 100% on your own 
responsibility. 
-- But if you use this code for academic or corporate 
training purposes, then please do let me know by an 
email or a (confidential) comment here. 
Thanks in advance!
-- History
* Begun: 23 Dec. 2018 
* Completed first version that works: 26 Dec. 2018 
* This version: Sunday 30 December 2018 11:54:52  IST 
'''

import random
import numpy as np 
import matplotlib.pyplot as plt 

################################################################################
# Global helper functions
################################################################################

################################################################################ 
# Helper function. Uses numpy.random.randn() to create inputs and targets, 
# whether for training or for testing. Useful for very simple 1D studies only. 
# (The input data as generated here very easily predict the targets! Even just 
# a single call to the softmax function would have been sufficient!!)
def GenerateData( nNeursInInput, nLenOPLayer, nCasesPerTarget ):
    # Python lists for inputs and targets, returned at the end by this function. 
    l_mdInput = []
    l_mdTarget = []
    for t in range( nLenOPLayer ):
        for i in range( nCasesPerTarget ):
            mdInput = np.random.rand( nNeursInInput, 1 )*0.1
            mdInput[t][0] = mdInput[t][0] + np.random.random_sample()*0.1 + 0.8
            l_mdInput.append( mdInput )
            
            mdTarget = np.zeros( (nLenOPLayer,1) ) 
            mdTarget[t][0] = 1.0
            l_mdTarget.append( mdTarget )
    # pickle returns data as arrays; let's simulate the same thing here
    amdInput = np.array( l_mdInput )
    amdOutput = np.array( l_mdTarget )
    # Python 3(+?) thing. Convert zip to list either here or later.
    # Better to do it here. 
    Data = list( zip( amdInput, amdOutput ) )
    return Data
   
################################################################################ 
# Helper function. Computes the sigmoid activation function
def Sigmoid( mdZ ):
    return 1.0 / ( 1.0 + np.exp( - mdZ ) )

################################################################################ 
# Helper function. Computes the derivative of the sigmoid activation function
def SigmoidDerivative( mdZ ):
    mdA = Sigmoid( mdZ )
    mdADer = mdA * ( 1.0 - mdA )
    return mdADer

################################################################################ 
# Helper function. Called with a single activation vector with its 
# target vector. Assumes that the form of the cost function for each 
# neuron is (note carefully that target comes first): 
#       C_x = 0.5 * ( dT - dA )^2 
# so that  
#       \partial C_x / \partial dA = - ( dT - dA ) = ( dA - dT ) 
# (where note that the activation comes first).
def CostDerivative( mdA_OP, mdT ):
    return ( mdA_OP - mdT ) 


################################################################################
# Class Network begins here
################################################################################

class Network(object):

    ############################################################################
    # Constructor. 
    # Supply an array containing number of neurons for each layer
    # e.g., [5,4,3]  
    def __init__(self, anNIL):
        self.m_nLayers = len( anNIL )
        self.m_anNIL = np.array( anNIL )
        # anNIL[1:] means: An array with all entries except the first
        # anNIL[:-1] means: An array with all entries except for the last
        # We allocate (N,1)-shaped matrices rather than (N)-shaped vectors
        # because later on in back-propagation, to have a vectorized code,
        # we perform np.dot() in its tensor-product-like avatar, and there,
        # we need both the operands to be matrices. 
        self.m_amdAllBs = [ np.random.randn( nNeursCur, 1 ) 
                            for nNeursCur in anNIL[1:] ] 
        self.m_amdAllWs = [ np.random.randn( nNeursCur, nNeursPrev ) 
                            for nNeursPrev, nNeursCur in
                                zip( anNIL[:-1], anNIL[1:] ) ] 
        pass

    ############################################################################
    # For each (input, target) tuple representing a row in the 
    # mini-batch, perform the training. We compute the changes (delta's) 
    # to be effected to biases and weights through feedback and 
    # backpropagation, add them up until the mini-batch is over, and then
    # apply them to update the weights and biases only once at the end of 
    # this function. Unlike in Nielsen's code, here we have a separate 
    # function each for feed-forward and back-propagation
    def UpdateMiniBatch( self, dEta, nMiniBatchSize, MiniBatch ):

        # Allocate a local matrix each for holding the GradB and GradW matrices
        # for all the layers. Both are initialized to zero's. Can't directly 
        # use np.zeros() because these matrices are not of uniform sizes; 
        # the no. of neurons in each layer is arbitrarily different
        amdAllGradBs = [ np.zeros( mdGradB.shape ) for mdGradB in self.m_amdAllBs ]
        amdAllGradWs = [ np.zeros( mdGradW.shape ) for mdGradW in self.m_amdAllWs ]
        l_mdErrorsMB = []

        # For each (input, target) tuple representing a row in the 
        # mini-batch, perform the training. 
        
        for x, y in MiniBatch:
            
            # Feed-Forward Pass: Feed the next input vector to the network 
            # using the existing weights and biases, and find the net-input 
            # and activations it produces at all the network layers.
            
            l_mdAllZs, l_mdAllAs = self.FeedForward( x ) 
            
            # Compute the error vector
            mdDiff = y - l_mdAllAs[ -1 ]
            mdErrors = 0.5 * mdDiff * mdDiff
            l_mdErrorsMB.append( mdErrors )
            
            # Back-Propagation Pass: Back-propagate the change in the total cost 
            # function (total error) to the local gradients to be applied to 
            # each weight (of a synaptic connection between current and previous
            # layers), and to each bias (of the current layer). 
            amdDeltaGradB, amdDeltaGradW = self.BackProp( l_mdAllAs, l_mdAllZs, y )
            
            # Add the changes to the local copies of the biases and weights. 
            # The zip function takes iterable elements as input, and returns an 
            # iterator for the tuple. Thus, zipping here makes it iterate through 
            # each layer of the network.
            amdAllGradBs = [ mdGradB + mdDeltaGradB 
                        for mdGradB, mdDeltaGradB in zip( amdAllGradBs, amdDeltaGradB ) ]

            amdAllGradWs = [ mdGradW + mdDeltaGradW 
                        for mdGradW, mdDeltaGradW in zip( amdAllGradWs, amdDeltaGradW ) ]
        # Processing for all the rows in the current mini-batch is now over. 
        # Now, update the network data-members from the local copies.
         
        # Note, we take an average of the changes over all rows.
        dEtaAvgOverMB = dEta / nMiniBatchSize

        # w means: the old weights matrix for a *single* layer only
        # nw means: \sum_{i}^{all rows of mini-batch} GradW_{i-th row}
        self.m_amdAllBs = [ mdB - dEtaAvgOverMB * mdGradB 
            for mdB, mdGradB in zip( self.m_amdAllBs, amdAllGradBs ) ]

        self.m_amdAllWs = [ mdW - dEtaAvgOverMB * mdGradW 
            for mdW, mdGradW in zip( self.m_amdAllWs, amdAllGradWs ) ]

        # Return the average error vector for this mini-batch
        return l_mdErrorsMB

    ############################################################################
    # Called for a single input vector (i.e., "x" in Nielsen's code)
    def FeedForward( self, mdA ):

        # Python list of all activations, layer-by-layer 
        l_mdAllAs = [ mdA ]
        # Python list for z vectors, layer-by-layer
        l_mdAllZs = []

        # For the weight and bias matrices of each layer... 
        for mdW, mdB in zip( self.m_amdAllWs, self.m_amdAllBs ):
            # Compute the net input vector to activate 
            # the neurons of this layer
            mdZ = np.dot( mdW, mdA ) + mdB
            l_mdAllZs.append( mdZ )
            
            # Compute the activations for all the neurons
            # of this layer 
            mdA = Sigmoid( mdZ )
            l_mdAllAs.append( mdA )

        return ( l_mdAllZs, l_mdAllAs )

    ############################################################################
    # Called with inputs and activations for all the layers, 
    # i.e., for the entire network in one go. 
    def BackProp( self, l_mdAllAs, l_mdAllZs, mdTarget ):

        # Allocate a local matrix each for holding the GradB and GradW matrices
        # for all the layers. Both are initialized to zero's. Can't use 
        # np.zeros() because these matrices are not of uniform sizes; 
        # the no. of neurons in each layer is arbitrarily different
        amdAllGradBs = [ np.zeros( mdB.shape ) for mdB in self.m_amdAllBs ]
        amdAllGradWs = [ np.zeros( mdW.shape ) for mdW in self.m_amdAllWs ]

        # Back-propagation occurs in two distinct stages:
        # (i) In the first stage, we begin from the end (the output layer), and
        # use the layer just before that, in order to update the biases and 
        # weights of the connections lying between the two (which are, in this 
        # class, are stored in the OP layer). 
        # (ii) In the second stage, we iterate back successively through the 
        # network, starting from the last hidden layer, up to the first hidden 
        # layer.
        # The split-up into two stages is necessary because the activation of a
        # neuron from the input- or hidden-layers flows through many (all) 
        # neurons in the immediately next layer, and the total cost function 
        # (i.e. the total error) is affected by *all* these paths. In contrast, 
        # the output layer has no next layer, and so, the activation of a 
        # neuron in the output layer affects the total cost function only 
        # through its own connection to the total cost, not that of other 
        # layers. So, the Delta at the OP layer can be directly 
        # computed from a knowledge of the OP layer activations (predictions) 
        # and the prescribed targets. But to compute the Delta at an 
        # intermediate layer, we do need the Delta for the immediately 
        # next layer.  

        ##########
        # Stage I: Compute mdGradA and mdGradB for the Output Layer
        
        # The OP layer is the last, accessed through the -1 index. Get the 
        # activations (A's) on its output side, and the net inputs (Z's) on its 
        # input side.
        mdA_OP = l_mdAllAs[ -1 ] 
        mdZ_OP = l_mdAllZs[ -1 ] 
        
        # Compute the partial derivatives, and use them to find the Grad for 
        # all B's in the OP layer.  
        mdZ_OP_Der = SigmoidDerivative( mdZ_OP )
        # As an intermediate product, find the delta for the OP layer. It is
        # subsequently used in the second stage as an input, thereby 
        # back-propagating the changes to the total cost function back 
        # throughout the network. 
        mdDelta = CostDerivative( mdA_OP, mdTarget ) * mdZ_OP_Der
        # Compute the Grad for all B's in the OP layer
        amdAllGradBs[ -1 ] = mdDelta
        
        # Compute the partial derivatives, and find the Grad for all W's in 
        # the layer just before the OP layer. Index -2 means: just before OP.
        # We use "LHL" to mean the last hidden layer. 
        mdA_LHL_Transpose = l_mdAllAs[ -2 ].transpose()
        # Here, we build the GradW *matrix* from what essentially are two 
        # vectors (or (N,1) matrices). Even if the numpy function says
        # "dot", it actually works like the tensor product---it outputs an 
        # (N X N) matrix, not a single number as a scalar. It is for this step
        # that we made even B's an (N,1) shaped matrix in the constructor. 
        amdAllGradWs[ -1 ] = np.dot( mdDelta, mdA_LHL_Transpose )
        
        ###########
        # Stage II: Compute mdGradB and mdGradB for the last hidden layer
        #           and go back, all the way up to the first hidden layer 

        # Note that only the negative of the index ever gets used in the 
        # following loop. 
        # a[-2] means the second-last layer (i.e. the last hidden layer).
        # a[-1] means the last layer (i.e., the OP layer).
        # a[-2-1] means the layer just before the second-last layer.
        # The current layer starts from the output layer. The previous
        # layer is the one before the current. 
        # If the indexing gets confusing, notice that Nielsen uses the range 
        # (2, nLayers), rather than that of (1, nLayers-1) which would have 
        # been a bit more natural, at least to me. But of course, his choice 
        # is slightly more consistent from a mathematical viewpoint---and 
        # from the fact that there are no weights and biases for the IP 
        # layer.   
        for nL in range( 2, self.m_nLayers ):
            # Find the mdDelta for the previous layer, using that for the
            # current layer...  
            mdZCur = l_mdAllZs[ -nL ]
            mdZCurDer = SigmoidDerivative( mdZCur )
            mdWNextTrans = self.m_amdAllWs[ -nL+1 ].transpose()
            # Note, for the very first pass through the loop, the mdDelta being 
            # used on the RHS is what was earlier computed in the Stage I given 
            # above (viz. for the output layer). It now gets updated here, so 
            # that it can act as the "previous" mdDelta in the next pass 
            # through this loop.
            mdDelta = np.dot( mdWNextTrans, mdDelta ) * mdZCurDer
            # mdDelta now refers to the current layer, not to the next
            amdAllGradBs[ -nL ] = mdDelta
            
            mdAPrevTrans = l_mdAllAs[ -nL-1 ].transpose()
            amdAllGradWs[ -nL ] = np.dot( mdDelta, mdAPrevTrans )

            # Explaining the above four lines is left as an exercise 
            # for the reader. (Hints: (i) Write out the matrix equation 
            # "straight," i.e., as used in the feed-forward pass. (ii) 
            # Use the rules for the matrix multiplications and the rules
            # for taking transposes.)
               
        return ( amdAllGradBs, amdAllGradWs )

    ############################################################################
    # Implements the stochastic gradient descent algorithm
    def SGD( self, sErrorsFileName, nEpochs, 
             TrainingData, nMiniBatchSize, TestData ):
        
        # This variable is used only for the x-axis during plotting the 
        # total activation errors at the OP layer.
        adErrors = np.zeros( nEpochs ) 

        # To have a small file-size, we persist the total error for the 
        # output layer only as averaged over the entire epoch (i.e., the 
        # entire training data, randomly shuffled). 
        foErrors = open( sErrorsFileName, 'w' )
        
        nTrainingCases = len( TrainingData )
       
        # An epoch is defined for the entire training data. However, each epoch
        # begins by randomly shuffling the training data. Effectively, we
        # change the initial position from which to begin descending down the
        # higher-dimensional total-cost-function surface. 
        for e in range( nEpochs ):

            # Create Mini-Batches covering the entire training data
            np.random.shuffle( TrainingData )
            MiniBatches = [ TrainingData[ k : k + nMiniBatchSize ] 
                            for k in range( 0, nTrainingCases, nMiniBatchSize ) ]
            
            dTotalErrorEpoch = 0
            # For each mini-batch
            for MiniBatch in MiniBatches:
                # Conduct the training over the entire mini-batch, 
                # and collect the errors accumulated over it. Add them
                # to the accumulated error for the current epoch. 
                l_mdErrorsMB = self.UpdateMiniBatch( dEta, nMiniBatchSize, MiniBatch )
                dAvgErrorMB = self.ComputeAvgErrorForMiniBatch( l_mdErrorsMB )
                dTotalErrorEpoch = dTotalErrorEpoch + dAvgErrorMB 

            dAvgErrorEpoch = dTotalErrorEpoch / nMiniBatchSize  
            adErrors[ e ] = dAvgErrorEpoch
            # Write to file
            sLine = "%E\n" % (dAvgErrorEpoch)
            foErrors.write( sLine )

            # self.Evaluate( TestData )
            print( "Epoch %d: Avg. Error: %lf" % (e, dAvgErrorEpoch) )

        foErrors.close()
        
        adx = np.arange( len(adErrors) )
        plt.plot( adx, adErrors )
        plt.xlabel( "Epochs" )
        plt.ylabel( "Total Error at the OP Layer\n(Epoch Avg. of Mini-Batch Avgs.)" )
        plt.title( "Training Errors" )
        plt.show()

    ############################################################################
    # Helper function
    def ComputeAvgErrorForMiniBatch( self, l_mdErrorsMB ):
        nSize = len( l_mdErrorsMB )
        dTotalErrorMB = 0 
        # For each training case in this mini-batch
        for i in range( nSize ):
            # Get the error vector for the i-th training case
            mdE = l_mdErrorsMB[ i ]
            dTotalErrorCase = mdE.sum()
            dTotalErrorMB = dTotalErrorMB + dTotalErrorCase
        # The average of the total errors for all cases in the current mini-batch
        dAvgErrorMB = dTotalErrorMB / nSize
        return dAvgErrorMB

    ############################################################################
    # This function is mostly useless in this simple a scenario; it predicts
    # 100 % accuracy right from 1st epoch! Obvious. With the kind of input 
    # data we generate, the prediction would have been 100 % accurate even
    # with a single function call to softmax! 
    #   
    # def Evaluate( self, TestData ):
    #     nTestCases = len( TestData )
    #     anPredictedRes = []
    #     for x, y in TestData:
    #         l_mdAllZs, l_mdAllAs = self.FeedForward( x )
            
    #         mdActivOP = l_mdAllAs[ -1 ] 
    #         nPredIdx = np.argmax( mdActivOP )
    #         nTargetIdx = np.argmax( y )
    #         anPredictedRes.append( int( nPredIdx == nTargetIdx ) )
    #     print( anPredictedRes )
    #     pass
################################################################################
# Class Network ends here
################################################################################


################################################################################
# main script
################################################################################

# Prepare Input Data

# The learning rate
dEta = 0.5 

# Number of Neurons contained In each successive Layer
anNIL = [10,10,10]

# Size of the input layer
nNeursInInput = anNIL[ 0 ]
# Size of the target layer
nTargets = anNIL[ -1 ]
# Total number of cases to have for training
nTotalNumCasesForTraining = 10000
# Total number of cases to have for testing
nTotalNumCasesForTesting = nTotalNumCasesForTraining // 5

# Number of cases to generate for each target for training
nCasesPerTargetForTraining = nTotalNumCasesForTraining // nTargets
# Number of cases to generate for each target for testing
nCasesPerTargetForTesting = nTotalNumCasesForTesting // nTargets


# For repeatability, seed the RNG in NumPy. For "true" random-ness
# at each program invocation, comment out the next line. Notice that 
# seeding, if any, must occur before generating data.   
np.random.seed( 0 )

TrainingData = GenerateData( anNIL[ 0 ], anNIL[ -1 ], nCasesPerTargetForTraining )
TestingData = GenerateData( anNIL[ 0 ], anNIL[ -1 ], nCasesPerTargetForTesting )

# Number of Epochs
nEpochs = 10
# Mini-Batch Size
nMiniBatchSize = 10

# Instantiate the network
# Optionally, we can have a re-seeding also here
# np.random.seed( 10000 )
nn = Network( anNIL )
# Let the network find the most optimum combination of values 
# for all the weights and biases it contains.
nn.SGD( "EpochErrors.txt", nEpochs, TrainingData, nMiniBatchSize, TestingData )

print( "Done!" )


Things for you to add and/or to try:

  • Add the code to persist the network (weights and biases) to a plain-text CSV file, and to re-init it from that file.
  • Add the code to show a simple (real-time) diagrammatic visualization of the training process for a tiny (say [4,3,2]) network, using a colormap-based shades being used for the lines of the synaptic weights and similarly colormap-based circles for neuron background for biases.
  • Write a code for a more ambitious code for generation of input data for n-dimensional data. It would have `m’ number of islands of volumes (of simplest topology) of randomly varying dark shades which are embedded into a randomly varying background of light shades. That is, it would be an n-dimensional generalization of taking the Windows Paint (or similar) program, having a nonuniform but light-gray fill-color, with dark-shaded blobs of arbitrary shapes randomly placed in it, sort of like a two-phase materials micro-structure. You have to take a configuration of such blobs and introduce random but small variations into its shape and position too. Then, take this small group of variations and assign a target value to it. Repeat for N number of targets.

 


No songs section this time. Will be back on 31st/1st, as promised (see the post just below).

A song I like:

[Realizing that you wouldn’t be in a mood to listen to (and be able to hum) a neat, melodious, golden oldie, may as well run it right away. … Yes, I will come back with my new year’s resolutions on the 31st/1st (see the post just below), but it’s likely that I won’t run any songs section at that time.]

(Hindi) “dhaDakne lage dil ke taron ki duniya…”
Singers: Mahendra Kapoor, Asha Bhosale
Music: N. Dutta
Lyrics: Sahir Ludhianvi

….Done. Take care and bye for now. See you on the 31st, with the new post…


History:
First published: 2018.12.27 23:29 IST
Updated (also the code): 2018.12.28 09.08 IST

 

Why are NYRs so hard to keep?

Why do people at all make all those New Year Resolutions (NYRs)? Any idea? And once having made them, why do they end breaking them all so soon? Why do the NYRs turn out to be so hard to keep?

You have tried making some resolutions at least a few times in the past, haven’t you? So just think a bit about it before continuing reading further—think why they were so hard to keep. … Was it all an issue of a lack of sufficient will power? Or was something else at work here? Think about it…

My answer appears immediately below, so if you want to think a little about it, then really, stop reading right here, and come back and continue once you are ready to go further.


My answer:

People make resolutions because they want to get better, and also decide on doing something about it, like, setting a concrete goal-posts about it.

Further, I think that people fail to keep the resolutions because they make them only at the 11th hour.


A frequently offered counter-argument:

Now, you might object to the first part of my answer. “Who takes all that self-improvement nonsense so seriously anyway?” you might argue. “People make resolutions simply because it’s a very common thing to do on the New Year’s Eve. Everyone else is happy making them, and so, you are led into believing that may be, you too should have a shot at it. But really speaking, the whole thing is just a joke.”

Good attempt at finding the reasons! But not exactly a very acute analysis. Let me show you how, by tackling just this one aspect: making resolutions just because the other people are doing the same…


Following other people—what does that exactly mean?:

If someone goes on to repeat a certain thing just as everyone else is doing it, then, does this fact by itself make him a part of the herd? a fool? Really? Think about it.

Suppose you have been watching an absolutely thrilling sports match, say a one-day international cricket match. Suppose you have specially arranged for a day’s leave from your work, and you have gone with your friends to the stadium. Suppose that the team you have been rooting for wins the finals. Everyone in your group suddenly begins dancing, yells, blows horns, beats drums, and all that. Your group generally begins to have a wild celebration together. Seeing them do that, almost like within a fraction of a second, you join them, too.

Does your action mean you have been a mindless sheep following the others in your group? Does it mean that you derived no personal pleasure from the win of your team? That you yourself had no desire to express your joy, your exhilaration? Is your excitement predominantly dependent, on such an occasion, on what other people are doing? Or is it the case that the excitement and the joy is all authentically your own, but it’s just that its outer expression differs. For instance, you wouldn’t be able to go *so* wild if your boss were to be sitting in the next row, rooting for the other team! May be it’s just your outer expression which is shaped by looking at how other people celebrate at the occasion. The most you actually gather by observing others is how to express your joy—not that you have joy. (Observe how the Mexican wave works.) It’s not an instance of the herd behaviour at all!

Something similar for the NYRs too. People make resolutions because there is some underlying cause, a personal reason, as to why they want to do that. And the reason is what I already said above. Namely, that they want to get better.

Of course, it’s not that you didn’t have any point in your argument above. The influence of the other people sure is always there. But it’s a minor, incidental, thing, occurring purely at the surface.


How people actually make their resolutions:

Coming back to the NYRs, it’s a fact that around the time of the year-end, there are a great number of other people who are so busy with certain things at this time of the year: compiling all those top 10 lists (for the bygone year), buying or gifting diaries or calendars (for the new year), and, of course, making resolutions for the new year. Often, they “seriously” let you in on what resolutions they have decided, too.

If so many people were not to get so enthusiastic about making these NYRs, it’s possible, nay, even probable, that you yourself wouldn’t have thought of doing the same thing on this occasion. Possible. So, in that sense, yes, you are getting influenced by what other people do.

Yet, when it is time to take the actual action, people invariably try to figure out what is personally important to them. Not to someone else. In making resolutions, people actually don’t think too much about society, come to think of it.

No one resolves something like, for instance, that he will take a 10,000 km one-way trip in the new year, and go help some completely random couple settle some issue between them like, you know, why he spends so much money on the gadgets, or why she spends so much time on getting ready—or how they should settle their divorce agreement. People typically aren’t very enthusiastic about keeping such aims by way of New Year’s Resolutions, especially if they involve complete strangers. Even if it is true that a lot of people do resolve to undertake some humanitarian service, it’s more out of feeling of having to combine something that is good, and something that is social—or altruistic. The first element (the desire something good, to bring about some “real change”) is the more dominant motivation there, most often. And even if it is true that there are just six degrees of separation between most of the humanity, the fact of the matter still remains that while settling down on their resolution, most people usually don’t traverse even just one degree, let alone all the rest 5 (i.e. the entire society).

On the other hand, quitting drinking—or at least resolving to limit themselves to “just a couple of pegs, that’s all” is different. This one particular resolution appears very regularly near the top of people’s lists. There often seems to be this underlying sense that there is an area where they need to improve themselves. An awareness of that vague sense is then followed by a resolution, a “commitment, come what may,” sort of. To give it a good try all over once again, so to speak.


The paradox, and a bit about my recent take about it:

And yet, despite this matter being of such a personal importance, people still often fail in keeping their resolutions. Think of the usual resolutions like “regular exercise,” or “not having any more than a 90 [ml of a hard-drink] on an evening,” or “maintaining all expenses on a daily basis, and balancing bank-books regularly…” These are some of the items that regularly appear on people’s list. That’s the good part. The bad part is, the same items happen to appear on the lists of the same people year after another year.

Now, coming to the reasons for such a mass-ive (I mean wide-spread) failure, I have already given you a hint above. People typically fail, I said, because they make those resolutions at the 11th hour. They make them on the spur of the moment, often thinking them up right on the night of the 31st itself.

OK, let me note an aside here. The issue, I think is not, really speaking, one of just time. Hey, what are those new year’s diaries and planners for, except for using them at the beginning of the year? And people do use such aids for some time period at the beginning. … So, yes, time-tables and all are  involved, and people still fail to keep up.

So, the issue must be deeper than that, I thought. In any case, I have come to form one hypothesis about it.

Come to think of it, some time ago, I had jotted down my thoughts on this matter in a somewhat lighter vein. I had said: if you want to keep your resolutions, make only those which you can actually keep!

Coming back to the hypothesis which I now have, well, it is somewhat on similar lines, but in a bit more detailed, more “advanced” sort of a way. I am going to test it on myself first at the turn of this year, and I am going to see how good or poor it turns out to be (for whatever worth this idea is as a hypothesis anyway).

As a part of my testing “strategy” I will also be announcing my NYRs on the 31st (or at the most the 1st) here. Stay tuned.


Oh yes, by way of a minor update, even if I was down for a few days with minor fever and nausea, I have by now well recovered, and already am back pursuing data science. … More, later.

… Oh yes, the crackers remind me. … Happy Christmas, once again…

Will be back on the 31st or 1st. Until then, take care, and bye for now…


A song I like:
(Hindi) “Yun hi chala chal rahi”
Singers: Kailash Kher, Hariharan, Udit Narayan
Music: A. R. Rahman
Lyrics: Javed Akhtar


[Guess no need to edit this post; it’s mostly come out as pretty OK right in the first pass; will leave it as is.]

A general update

Hmmm… Slightly more than 3 weeks since I posted anything here. A couple of things happened in the meanwhile.


1. Wrapping up of writing QM scripts:

First, I wrapped up my simulations of QM. I had reached a stage (just in my mind, neither on paper nor on laptop) whereby the next thing to implement would have been: the simplest simulations using my new approach. … Ummm… I am jumping ahead of myself.

OK, to go back a bit. The way things happened, I had just about begun pursuing Data Science when this QM thingie (conference) suddenly came up. So, I had to abandon Data Science as is, and turn my attention full-time to QM. I wrote the abstract, sent it to the conference, and started jotting down some of the early points for the eventual paper. Frequent consultations with text-books was a part of it, and so was searching for any relevant research papers. Then, I also began doing simulations of the simplest textbook cases, just to see if I can find any simpler route from the standard / mainstream QM to my re-telling of the facts covered by it.

Then, as things turned out, my abstract for the conference paper got rejected. However, now that I had gotten a tempo for writing and running the simulations, I decided to complete at least those standard UG textbook cases before wrapping up this entire activity, and going back to Data Science. My last post was written when I was in the middle of this activity.

While thus pursuing the standard cases of textbook QM (see my last post), I also browsed a lot, thought a lot, and eventually found that simulations involving my approach shouldn’t take as long as a year, not even several months (as I had mentioned in my last post). What happened here was that during the aforementioned activity, I ended up figuring out a far simpler way that should still illustrate certain key ideas from my new approach.

So, the situation, say in the first week of December, was the following: (i) Because the proposed paper had been rejected, there was no urgency for me to continue working on the QM front. (ii) I had anyway found a simpler way to simulate my new approach, and the revised estimates were that even while working part-time, I should be able to finish the whole thing (the simulations and the paper) over just a few months’ period, say next year. (iii) At the same time, studies of Data Science had anyway been kept on the back-burner.

That’s how (and why) I came to wrap up all my activity on the QM front, first thing.

I then took a little break. I then turned back to Data Science.


2. Back to Data Science:

As far as learning Data Science goes, I knew from my past experience that books bearing titles such as: “Learn Artificial Intelligence in 3 Days,” or “Mastering Machine Learning in 24 Hours,” if available, would have been very deeply satisfying, even gratifying.

However, to my dismay, I found that no such titles exist. … Or, may be, such books are there, but someone at Google is deliberately suppressing the links to them. Whatever be the case, forget becoming a Guru in 24 hours (or even in 3 days), I found that no one was promising me that I could master even just one ML library (say TensorFlow, or at least scikit-learn) over even a longer period, say about week’s time or so.

Sure there were certain other books—you know, books which had blurbs and reader-reviews which were remarkably similar to what goes with those mastering-within-24-hours sort of books. However, these books had less appealing titles. I browsed through a few of these, and found that there simply was no way out; I would have to begin with Michael Nielsen’s book [^].

Which I did.

Come to think of it, the first time I had begun with Nielsen’s book was way back, in 2016. At that time, I had not gone beyond the first couple of sections of the first chapter or so. I certainly had not come to even going through the first code snippet that Nielsen gives, let alone running it, or trying any variations on it.

This time around, though, I decided to stick it out with this book. I had to. … What was the end result?

Well, unlike me, I didn’t take any jumps while going through this particular book. I began reading it in the given sequence, and then found that I could even continue with the same (i.e., reading in sequence)! I also made some furious underlines, margin-notes, end-notes, and all that. (That’s right. I was not reading this book online; I had first taken a printout.) I also sketched a few data structures in the margins, notably for the code around the “w” matrices. (I tend to suspect every one else’s data structures except for mine!) I pursued this activity covering about everything in the book, except for the last chapter. It was at this point that finally my patience broke down. I went back to my usual self and began jumping back and forth over the topics.

As a result, I can’t say that I have finished the book. But yes, I think I’ve got a fairly idea of what’s there in it.

So there.


3. What books to read after Nielsen’s?

Of course, Nielsen’s book wasn’t the only thing that I pursued over the past couple of weeks. I also very rapidly browsed through some other books, checked out the tutorial sites on libraries like scikit-learn, TensorFlow, etc. I came to figure out two things:

As the first thing, I found that I was unnecessarily getting tense when I saw young people casually toss around some fearsome words like “recurrent learning,” “convolutional networks,” “sentiments analysis,” etc., all with such ease and confidence. Not just on the ‘net but also in real life. … I came to see them do that when I attended a function for the final-rounds presentations at Intel’s national-level competition (which was held in IISER Pune, a couple of months ago or so). Since I had seen those quoted words (like “recurrent learning”) only while browsing through text-books or Wiki articles, I had actually come to feel a bit nervous at that event. Ditto, when I went through the Quora answers. Young people everywhere in the world seemed to have put in a lot of hard-work in studying Data Science. “When am I going to catch up with them, if ever?” I had thought.

It was only now, after going through the documentation and tutorials for these code libraries (like scikit-learn) that I came to realize that the most likely scenario here was that most of these kids were simply talking after trying out a few ready-made tutorials or so. … Why, one of the prize-winning (or at least, short-listed) presentations at that Intel competition was about the particles-swam optimization, and during their talk, the students had even shown a neat visualization of how this algorithm works when there are many local minima. I had got impressed a lot by that presentation. … Now I gathered that it was just a ready-made animated GIF lifted from KDNuggets or some other, similar, site… (Well, as it turns out, it must have been from the Wiki! [^])

As the second thing, I realized that for those topics which Nielsen doesn’t cover, good introductory books are hard to find. (That was a bit of an understatement. My real feel here is that, we are lucky that Nielsen’s book is at all available in the first place!)

…If you have any tips on a good book after Nielsen’s then please drop me an email or a comment; thanks in advance.


4. A tentative plan:

Anyway, as of now, a good plan seems to be: (i) first, to complete the first pass through Nielsen’t book (which should take just about a couple of days or so), and then, to begin pursuing all of the following, more or less completely simultaneously: (ii) locating and going through the best introductory books / tutorials on other topics in ML (like PCA, k-means, etc); (iii) running tutorials of ML libraries (like scikit-learn and TensorFlow); (iv) typing out LaTeX notes for Nielsen’s book (which would be useful eventually for such things as hyper-parameter tuning), and running modified (i.e., simplified) versions of his code (which means, the second pass through his book); and finally (v) begin cultivating some pet project from Data Science for moonlighting over a long period of time (just the way I have maintained a long-running interest in the micro-level water-resources engineering).

As to the topic for the pet project, here are the contenders as of today. I have not finalized anything just as yet (and am likely not to do so for quite some time), but the following seem to be attractive: (a) Predicting rainfall in India (though getting granular enough data is going to be a challenge), (b) Predicting earth-quakes (locations and/or intensities), (c) Identifying the Indian classical “raaga” of popular songs, etc. … I also have some other ideas but these are more in the nature of professional interests (especially, for application in engineering industries). … Once again, if you feel there is some neat idea that could be adopted for the pet project, then sure point it out to me. …


…Anyway, that’s about it! Time to sign off. Will come back next year—or if some code / notes get written before that, then even earlier, but no definite promises.

So, until then, happy Christmas, and happy new year!…


A song I like:

(Marathi) “mee maaze mohita…”
Lyrics: Sant Dnyaaneshwar
Music and Singer: Kishori Amonkar


[One editing pass is still due; should be effected within a day or two. Done on 2018.12.18 13:41 hrs IST.]