A Special note for the Potential Employers from the Data Science field:
Recently, in April 2020, I achieved a World Rank # 5 on the MNIST problem. The initial announcement can be found here [^], and a further status update, here [^].
All my data science-related posts can always be found here [^].
1.The first cut for the H atom in a 3D box:
The last time [^], I spoke of an enjoyable activity, namely, how to make the tea (and also how to have it).
Talking of other, equally enjoyable things, I have completed the Python code for simulating the H atom in a box.
In the first cut for the 3D code (as also in the previous code in this series [^]), I used NumPy’s dense matrices, and the Python ``for” loops. Running this preliminary code, I obtained the following colourful diagrams, and twitted them:
H atom in a 3D box of 1 angstrom sides. Ground state (i.e., 1s, eigenvalue index = 0). All contours taken together show a single stationary state. Contour surfaces plotted with wiremesh. Plotted with Mayavi’s mlab.contour3d().
H atom in a 3D box of 1 angstrom sides. A ‘p’ state (eigenvalue index = 2). All contours taken together show a single stationary state. Contour surfaces with the Gouraud interpolation. Plotted with Mayavi’s mlab.contour3d().
H atom in a 3D box of 1 angstrom sides. A ‘p’ state (eigenvalue index = 2). All contours taken together show a single stationary state. Contour surfaces with wiremesh. Plotted with Mayavi’s mlab.contour3d().
H atom in a 3D box of 1 angstrom sides. Another ‘p’ state (eigenvalue index = 3). All contours taken together show a single stationary state. Contour surfaces with the Gourauad interpolation. Plotted with Mayavi’s mlab.contour3d().
OK, as far as many (most?) of you are concerned, the enjoyable part of this post is over. So, go read something else on the ‘net.
Coming back to my enjoyment…
2. Sparse matrices. Vectorization of code:
After getting to the above plots with dense matrices and Python “for” loops, I then completely rewrote the whole code using SciPy‘s sparse matrices, and put a vectorized code in place of the Python “for” loops. (As a matter of fact, in the process of rewriting, I seem to have deleted the plots too. So, today, I took the above plots from my twitter account!)
2.1 My opinions about vectorizing code, other programmers, interviews, etc:
Vectorization was not really necessary in this problem (an eigenvalue problem), because even if you incrementally build the FD-discretized Hamiltonian matrix, it takes less than 1 percent of the total execution time. 99 % of the execution time is spent in the SciPy library calls.
Python programmers have a habit of always looking down on the simple “for” loops—and hence, on any one who writes them. So, I decided to write this special note.
The first thing you do about vectorization is not to begin wondering how best to implement it for a particular problem. The first thing you do is to ask yourself: Is vectorization really necessary here? Ditto, for lambda expressions. Ditto, for list comprehension. Ditto for itertools. Ditto for almost anything that is a favourite of the dumb interviewers (which means, most Indian Python programmers).
Vectorized codes might earn you brownie points among the Python programmers (including those who interview you for jobs). But such codes are more prone to bugs, harder to debug, and definitely much harder to understand even by you after a gap of time. Why?
That’s because practically speaking, while writing in Python, you hardly if ever define C-struct like things. Python does have classes. But these are rather primitive classes. You are not expected to write code around classes, and then put objects into containers. Technically, you can do that, but it’s not at all efficient. So, practically speaking, you are almost always into using the NumPy ndarrays, or similar things (like Pandas, xarrays, dasks, etc.).
Now, once you have these array-like thingies, indexing becomes important. Why? Because, in Python, it is the design of a number of arrays, and the relationships among their indexing scheme which together “defines” the operative data structures. Python, for all its glory, has this un-removable flaw: The design of the data structures is always implicit; never directly visible through a language construct.
So, in Python, it’s the indexing scheme which plays the same part as the classes, inheritance, genericity play in C++. But it’s implicit. So, how you implement the indexing features becomes of paramount importance.
And here, in my informed opinion, the Python syntax for the slicing and indexing operations has been made unnecessarily intricate. I, for one, could easily design an equally powerful semantics that comes with a syntax that’s much easier on the eye.
In case some professionally employed data scientist (especially a young Indian one) takes an offence to my above claim: Yes, I do mean what I say above. And, I also know what I am talking about.
Though I no longer put it on my CV, once, in the late 1990s, I had implemented a Yacc-like tool to output table-driven parser for the LALR-1 languages (like Java and C++). It would take a language specification in the EBNF (Extended Backus-Noor Form) as the input file, and produce the tables for table-driven parsing of that language. I had implemented this thing completely on my own, looking just at the Dragon Book (Aho, Sethi, Ullman). I haven’t had a CS university education. So, I taught myself the compilers theory, and then, began straight implementing it.
I looked at no previous code. And even if I were to look at something, it would have been horrible. These were ancient projects, written in C, not in C++, and written using arrays, no STL containers like “map”s. A lot of hard-coding, pre-proc macros, and all that. Eventually, I did take a look at the others’ code, but it was only in the verification stage. How did my code fare? Well, I didn’t have to change anything critical.
I had taken about 8 months for this exercise (done part time, on evenings, as a hobby). The closest effort was by some US mountain-time university group (consisting of a professor, one or two post-docs, and four-five graduate students). They had taken some 4 years to reach roughly the same place. To be fair, their code had many more features. But yes, both their code and mine addressed those languages which belonged to the same class of grammar specification, and hence, both would have had the same parsing complexity.
I mention it all it here mainly in order to “assure” the Indian / American programmers (you know, those BE/BS CS guys, the ones who are right now itching to fail me in any interview, should their HR arrange one for me in the first place) that I do know a bit about it when I was talking about the actual computing operations on one hand, and the mere syntax for those operations on the other. There are a lot of highly paid Indian IT professionals who never do learn this difference (but take care to point out that your degree isn’t from the CS field).
So, my conclusion is that despite all its greatness (and I do actually love Python), its syntax does have some serious weaknesses. Not just idiosyncrasies (which are fine) but actual weaknesses. The syntax for slicing and indexing is a prominent part of it.
Anyway, coming back to my present code (for the H atom in the 3D box, using finite difference method), if the execution time was so short, and if vectorization makes a code prone to bugs (and difficult to maintain), why did I bother implementing it?
- I wanted to have a compact-looking code. I was writing this code mainly for myself, so maintenance wasn’t an issue.
- In case some programmer / manager interviewing me began acting over-smart, I wanted to have something which I could throw at his face. (Recently, I ran into a woman who easily let out: “What’s there in a PoC (proof of concept, in case you don’t know)? Any one can do a PoC…” She ranted on a bit, but it was obvious that though she has been a senior manager and all, and lists managing innovations and all, she doesn’t know. There are a lot of people like her in the Indian IT industry. People who act over-smart. An already implemented vectorized code, especially one they find difficult to read, would be a nice projectile to have handy.
2.2 Sparse SciPy matrices:
Coming back to the present code for the H atom: As I was saying, though vectorization was not necessary, I have anyway implemented the vectorization part.
I also started using sparse matrices.
In case you don’t know, SciPy‘s and NumPy‘s sparse matrix calls look identical, but they go through different underlying implementations.
From what I have gathered, it seems safe to conclude this much: As a general rule, if doing some serious work, use SciPy’s calls, not NumPy’s. (But note, I am still learning this part.)
With sparse matrices, now, I can easily go to a domain. I haven’t empirically tested the upper limit on my laptop, though an even bigger mesh should be easily possible. In contrast, earlier, with dense matrices, I was stuck at at most at a mesh. The execution time too reduced drastically.
In my code, I have used only the dok_matrix() to build the sparse matrices, and only the tocsr(), and tocoo() calls for faster matrix computations in the SciPy eigenvalue calls. These are the only functions I’ve used—I haven’t tried all the pathways that SciPy opens up. However, I think that I have a pretty fast running code; that the execution time wouldn’t improve to any significant degree by using some other combination of calls.
2.3 A notable curiosity:
I also tried, and succeeded to a great degree, in having an exactly identical code for all dimensions: 1D, 2D, 3D, and then even, in principle, ND. That is to say, no “if–else” statements that lead to different execution paths depending on the dimensionality.
If you understand what I just stated, then you sure would want to have a look at my code, because nothing similar exists anywhere on the ‘net (i.e., within the first 10 pages thrown up by Google during several differently phrased searches covering many different domains).
However, eventually, I abandoned this approach, because it made things too complicated, especially while dealing with computing the Coulomb fields. The part dealing with the discretized Laplacian was, in contrast, easier to implement, and it did begin working fully well, which was when I decided to abandon this entire approach. In case you know a bit about this territory: I had to liberally use numpy.newaxis.
Eventually, I came to abandon this insistence on having only a single set of code lines regardless of the dimensionality, because my programmer’s/engineer’s instincts cried against it. (Remember I don’t even like the slicing syntax of Python?) And so, I scrapped it. (But yes, I do have a copy, just in case someone wants to have a look.)
2.4 When to use the “for” loops and when to use slicing + vectorization: A good example:
I always try to lift code if a suitable one is available ready made. So, I did a lot of search for Python/MatLab code for such things.
As far as the FD implementations of the Laplacian go, IMO, the best piece of Python code I saw (for this kind of a project) was that by Prof. Christian Hill [^]. His code is available for free from the site for a book he wrote; see here [^] for an example involving the finite difference discretization of the Laplacian.
Yes, Prof. Hill has wisely chosen to use only the Python “for” loops when it comes to specifying the IC. Thus, he reserves the vectorization only for the time-stepping part of the code.
Of course, unlike Prof. Hill’s code (transient diffusion), my code involves only eigenvalue computations—no time-stepping. So, one would be even more amply justified in using only the “for” loops for building the Laplacian matrix. Yet, as I noted, I vectorized everything in my code, merely because I felt like doing so. It’s during vectorization that the problem of differing dimensionality came up, which I solved, and then abandoned.
2.5 Use of indexing matrices:
While writing my code, I figured out that a simple trick with using index matrices and arrays makes the vectorization part even more compact (and less susceptible to bugs). So, I implemented this approach—indexing matrices and arrays.
“Well, this is a very well known approach. What’s new?” you might ask. The new part is the use of matrices for indexing, not arrays. Very well known, sure. But very few people use it anyway.
Again, I was cautious. I wrote the code, saw it a couple of days later again, and made sure that using indices really made the code easier to understand—to me, of course. Only then I decided to retain it.
By using the indexing matrices, the code indeed becomes very clean-looking. It certainly looks far better (i.e. easier to grasp structure) than the first lines of code in Prof. Hill’s “do_timestep” function [^].
2.6 No code-drop:
During my numerous (if not exhaustive) searches, I found that no one posts a 3D code for quantum simulation that also uses finite differences (i.e. the simplest numerical technique).
Note, people do post codes for 3D, but these are only for more complicated approaches like: FDTD (finite difference time domain), FEM, (pseudo)spectral methods, etc. People also post code for FDM, when the domain is 1D. But none posts a code that is both FD and 2D/3D. People only post the maths for such a case. Some rare times, they also post the results of the simulations. But they don’t post the 3D FDM code. I don’t know the reason for this.
May be there is some money to be made if you keep some such tricks all to yourself?
Once this idea occurred to me, it was impossible for me not to take it seriously. … You know that I have been going jobless for two years by now. And, further, I did have to invest a definite amount of time and effort in getting those indexing matrices working right so that the vectorization part becomes intuitive.
So, I too have decided not to post my 3D code anywhere on the ‘net for free. Not immediately anyway. Let me think about it for a while before I go, post my code.
3. Covid in India:
The process of unlocking down has begun in India. However, the numbers simply aren’t right for any one to get relaxed (except for the entertainment sections of the Indian media like the Times of India, Yahoo!, etc.).
In India, we are nowhere near turning the corner. The data about India are such that even the time when the flattening might occur, is not just hard to predict, but with the current data, it is impossible.
Yes, I said impossible. I could forward reasoning grounded in sound logic and good mathematics (e.g., things like Shannon’s theorem, von Neumann’s errors analysis, etc.), if you want. But I think to any one who really knows a fair amount of maths, it’s not necessary. I think they will understand my point.
Let me repeat: The data about India are such that even the time when the flattening might occur, is not just hard to predict, but with the current data, it is impossible.
India’s data show a certain unique kind of a challenge for the data scientist—and it definitely calls for some serious apprehension by every one concerned. The data themselves are such that predictions have to be made very carefully.
If any one is telling you that India will cross (or has already crossed), say, more than 20 lakh cases, then know that he/she is not speaking from the data, the population size, the social structures, the particular diffusive dynamics of this country, etc. He/she is talking purely from imagination—or very poor maths.
Ditto, if someone tells you that there are going be so many cases in this city or that, by this date or that, if the date runs into, say, August.
Given the actual data, in India, projections about number of cases in the future are likely to remain very tentative (having very big error bands).
Of course, you may still make some predictions, like those based on the doubling rate. You would be even justified in using this measure, but only for a very short time-span into the future. The reason is that India’s data carry these two peculiarities:
- The growth rate has been, on a large enough scale, quite steady for a relatively longer period of time. In India, there has been no exponential growth with a very large log-factor, not even initially (which I attribute to an early enough a lock-down). There also has been no flattening (for whatever reasons, but see the next one).
- The number of cases per million population still remains small.
Because of 1., the doubling rate can serve as a good short-term estimator when it comes to activities like large-scale resource planning (but it would be valid only for the short term). You will have to continuously monitor the data, and be willing to adjust your plans. Yet, the fact is also that the doubling rate has remained steady long enough that it can certainly be used for short-term planning (including by corporates).
However, because of 2., everyone will have to revise their estimates starting from the third week of June, when the effects of the un-locking down begin to become visible (not just in the hospitals or the quarantine centres, but also in terms of aggregated numbers).
Finally, realize that 1. matters only to the policy-makers (whether in government or in corporate sectors).
What matters to the general public at large is this one single question: Have we turned around the corner already? if not, when will we do that?
The short answers are: “No” and “Can’t Tell As of Today.”
In India’s case the data themselves are such that no data scientist worth his salt would be able to predict the time of flattening with any good accuracy—as of today. Nothing clear has emerged, even after 2.5 months, in the data. Since this sentence is very likely to be misinterpreted, let me explain.
I am not underestimating the efforts of the Indian doctors, nurses, support staff, police, and even other government agencies. If they were not to be in this fight, the data would’ve been far simpler to analyse—and far more deadly.
Given India’s population size, its poverty, its meagre medical resources, the absence of civic discipline, the illiteracy (which makes using symbols for political parties indispensable at the time of elections)… Given all such factors, the very fact that India’s data even today (after 2.5 months) still manages to remain hard to analyse suggests, to my mind, this conclusion:
There has been a very hard tussle going on between man and the virus so that no definitive trend could emerge either way.
There weren’t enough resources so that flattening could occur by now. If you kept that expectation to begin with, you were ignoring reality.
However, in India, the fight has been such that it must have been very tough on the virus too—else, the exponential function is too bad for us, and it is too easy for the virus.
The inability to project the date by which the flattening might be reached, must be seen in such a light.
The picture will become much clearer starting from two weeks in the future, because it would then begin reflecting the actual effects that the unlocking is producing right now.
So, if you are in India, take care even if the government has now allowed you to step out, go to office, and all that. But remember, you have to take even more care than you did during the lock-down, at least for the next one month or so, until the time that even if faint, some definitely discernible trends do begin to emerge, objectively speaking.
I sincerely hope that every one takes precautions so that we begin to see even just an approach towards the flattening. Realize, number of cases and number deaths increase until the flattening occurs. So, take extra care, now that the diffusivity of people has increased.
A song I like:
(Western, instrumental): Mozart, Piano concerto 21, k. 467, second movement (andante in F major).
Listen, e.g., at this [^] YouTube viedo.
[ I am not too much into Western classical, though I have listened to a fair deal of it. I would spend hours in UAB’s excellent music library listening to all sorts of songs, though mostly Western classical. I would also sometimes make on-the-fly requests to the classical music channel of UAB’s radio station (or was it a local radio station? I no longer remember). I didn’t always like what I listened to, but I continuing listening a lot anyway.
Then, as I grew older, I began discovering that, as far as the Western classical music goes, very often, I actually don’t much appreciate even some pieces that are otherwise very highly regarded by others. Even with a great like Mozart, there often are places where I can’t continue to remain in the flow of the music. Unknowingly, I come out of the music, and begin wondering: Here, in this piece, was the composer overtaken by a concern to show off his technical virtuosity rather than being absorbed in the music? He does seem to have a very neat tune somewhere in the neighbourhood of what he is doing here. Why doesn’t he stop tinkling the piano or stretching the violin, stop, think, and resume? I mean, he was composing music, not just blogging, wasn’t he?
The greater the composer or the tune suggested by the piece, the greater is this kind of a disappointment on my part.
Then, at other times, these Western classical folks do the equivalent of analysis-paralysis. They get stuck into the same thing for seemingly forever. If composing music is difficult, composing good music in the Western classical style is, IMHO, exponentially more difficult. That’s the reason why despite showing a definite “cultured-ness,” purely numbers-wise, most Western classical music tends to be boring. … Most Indian classical music also tends to be very boring. But I will cover it on some other day. Actually, one day won’t be enough. But still, this post is already too big…
Coming to the Western classical, Mozart, and the song selected for this time: I think that if Mozart were to do something different with his piano concerto no. 20 (k. 466), then I might have actually liked it as much as k. 467, perhaps even better. (For a good YouTube video on k. 466, see here [^].)
But as things stand, it’s k. 467. It is one of the rarest Western (or Eastern) classical pieces that can auto ride on my mind at some unpredictable moments; also one of the rare pieces that never disappoint me when I play it. Maybe that’s because I don’t play it unless I am in the right mood. A mood that’s not at all bright; a mood that suggests as if someone were plaintively raising the question “why? But why?”. (Perhaps even: “But why me?”) It’s a question not asked to any one in particular. It’s a question raised in the midst of continuing to bear either some tragic something, or, may be, a question raised while in the midst of having to suffer the consequences of someone else’s stupidity or so. … In fact, it’s not even a question explicitly raised. It’s to do with some feeling which comes before you even become aware of it, let alone translate it into a verbal question. I don’t know, the mood is something like that. … I don’t get in that kind of a mood very often. But sometimes, this kind of a mood is impossible to avoid. And then, if the outward expression of such a mood also is this great, sometimes, you even feel like listening to it… The thing here is, any ordinary composer can evoke pathos. But what Mozart does is in an entirely different class. He captures the process of forming that question clearly, alright. But he captures the whole process in such a subdued manner. Extraordinary clarity, and extraordinary subdued way of expressing it. That’s what appeals to me in this piece… How do I put it?… It’s the epistemological clarity or something like that—I don’t know. Whatever be the reason, I simply love this piece. Even if I play it only infrequently.
Coming back to the dynamic k. 466 vs. the quiet, sombre, even plaintive k. 467, I think, the makers of the “Elvira Madigan” movie were smart; they correctly picked up the k. 467, and only the second movement, not others. It’s the second movement that’s musically extraordinary. My opinion, anyway…
Bye for now.