Non-Interview Questions on Data Science—Part 1

This entry is the first in a series of posts which will note some of the questions that no one will ever ask you during any interview for any position in the Data Science industry.

Naturally, if you ask for my opinion, you should not consider modifying these questions a bit and posting them as a part of your own post on Medium.com, AnalyticsVidhya, KDNuggets, TowardsDataScience, ComingFromDataScience, etc.

No, really! There would be no point in lifting these questions and posting them as if they were yours, because no one in the industry is ever going to get impressed by you because you raised them. … I am posting them here simply because… because “I am like that only.”

OK, so here is the first installment in this practically useless series. (I should know. I go jobless.)

(Part 1 mostly covers linear and logistic regression, and just a bit of probability.)


Q.1: Consider the probability theory. How are the following ideas related to each other?: random phenomenon, random experiment, trial, result, outcome, outcome space, sample space, event, random variable, and probability distribution. In particular, state precisely the difference between a result and an outcome, and between an outcome and an event.

Give a few examples of finite and countably infinite sample spaces. Give one example of a random variable whose domain is not the real number line. (Hint: See the Advise at the end of this post concerning which books to consult.)


Q.2: In the set theory, when a set is defined through enumeration, repeated instances are not included in the definition. In the light of this fact, answer the following question: Is an event a set? or is it just a primitive instance subsumed in a set? What precisely is the difference between a trial, a result of a trial, and an event? (Hint: See the Advise at the end of this post concerning which books to consult.)


Q.3: Select the best alternative: In regression for making predictions with a continuous target data, if a model is constructed in reference to the equation y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3, then:
(a) It is a sub-type of the linear regression model.
(b) It is a polynomial regression model.
(c) It is a nonlinear regression model because powers > 1 of the independent variable x_i are involved.
(d) It is a nonlinear regression model because more than two \beta_m terms are involved.
(e) Both (a) and (b)
(g) Both (b) and (c)
(f) Both (c) and (d)
(g) All of (b), (c), and (d)
(h) None of the above.
(Hint: Don’t rely too much on the textbooks being used by the BE (CS) students in the leading engineering colleges in Pune and Mumbai.)


Q.4: Consider a data-set consisting of performance of students on a class test. It has three columns: student ID, hours studied, and marks obtained. Suppose you decide to use the simple linear regression technique to make predictions.

Let’s say that you assume that the hours studied are the independent variable (predictor), and the marks obtained are the dependent variable (response). Making this assumption, you make a scatter plot, carry out the regression, and plot the regression line predicted by the model too.

The question now is: If you interchange the designations of the dependent and independent variables (i.e., if you take the marks obtained as predictors and the hours studied as responses), build a second linear model on this basis, and plot the regression line thus predicted, will it coincide with the line plotted earlier or not. Why or why not?

Repeat the question for the polynomial regression. Repeat the question if you include the simplest interaction term in the linear model.


Q.5: Draw a schematic diagram showing circles for nodes and straight-lines for connections (as in the ANN diagrams) for a binary logistic regression machine that operates on just one feature. Wonder why your text-book didn’t draw it in the chapter on the logistic regression.


Q.6: Suppose that the training input for a classification task consists of r number of distinct data-points and c number of features. If logistic regression is to be used for classification of this data, state the number of the unknown parameters there would be. Make suitable assumptions as necessary, and state them.


Q.7: Obtain (or write) some simple Python code for implementing from the scratch a single-feature binary logistic regression machine that uses the simple (non-stochastic) gradient descent method that computes the gradient for each row (batch-size of 1).

Modify the code to show a real-time animation of how the model goes on changing as the gradient descent algorithm progresses. The animation should depict a scatter plot of the sample data (y vs. x) and not the parameters space (\beta_0 vs. \beta_1). The animation should highlight the data-point currently being processed in a separate color. It should also show a plot of the logistic function on the same graph.

Can you imagine, right before running (or even building) the animation, what kind of visual changes is the animation going to depict? how?


Q.8: What are the important advantage of the stochastic gradient descent method over the simple (non-stochastic) gradient descent?


Q.9: State true or false: (i) The output of the logistic function is continuous. (ii) The minimization of the cost function in logistic regression involves a continuous dependence on the undetermined parameters.

In the light of your answers, explain the reason why the logistic regression can at all be used as a classification mechanism (i.e. for targets that are “discrete”, not continuous). State only those axioms of the probability theory which are directly relevant here.


Q.10: Draw diagrams in the parameters-space for the Lasso regression and the Ridge regression. The question now is to explain precisely what lies inside the square or circular region. In each case, draw an example path that might get traced during the gradient descent, and clearly explain why the progress occurs the way it does.


Q.11: Briefly explain how the idea of the logistic regression gets applied in the artificial neural networks (ANNs). Suppose that a training data-set has c number of features, r number of data-rows, and M number of output bins (i.e. classification types). Assuming that the neural network does not carry any hidden layers, calculate the number of logistic regressions that would be performed in a single batch. Make suitable assumptions as necessary.

Does your answer change if you consider the multinomial logistic regression?


Q.12: State the most prominent limitation of the gradient descent methods. State the name of any one technique which can overcome this limitation.


Advise: To answer the first two questions, don’t refer to the programming books. In fact, don’t even rely too much on the usual textbooks. Even Wasserman skips over the topic and Stirzaker is inadquate. Kreyszig is barely OK. A recommended text (more rigorous but UG-level, and brief) for this topic is: “An Introduction to Probability and Statistics” (2015) Rohatgi and Saleh, Wiley.


Awww… Still with me?

If you read this far, chances are very bright that you are really^{really} desperately looking for a job in the data science field. And, as it so happens, I am also a very, very kind hearted person. I don’t like to disappoint nice, ambitious… err… “aspiring” people. So, let me offer you some real help before you decide to close this page (and this blog) forever.

Here is one question they might actually ask you during an interview—especially if the interviewer is an MBA:

A question they might actually ask you in an interview: What are the three V’s of big data? four? five?

(Yes, MBA’s do know arithmetic. At least, it was there on their CAT / GMAT entrance exams. Yes, you can use this question for your posts on Medium.com, AnalyticsVidhya, KDNuggets, TowardsDataScience, ComingFromDataScience, etc.)


A couple of notes:

  1. I might come back and revise the questions to make them less ambiguous or more precise.
  2. Also, please do drop a line if any of the questions is not valid, or shows a poor understanding on my part—this is easily possible.

 


A song I like:

[Credits listed in a random order. Good!]

(Hindi) “mausam kee sargam ko sun…”
Music: Jatin-Lalit
Singer: Kavita Krishnamoorthy
Lyrics: Majrooh Sultanpuri


History:

First written: Friday 14 June 2019 11:50:25 AM IST.
Published online: 2019.06.16 12:45 IST.
The songs section added: 2019.06.16 22:18 IST.

Advertisements

The seven books challenge—my list

“Accepted challenge to post covers of 7 books I love: no explanations, no reviews – just the cover”

You might have run into tweets of the above kind in the recent past. Here, I would like to accept that challenge. [Unlike those tweets, there is no “from” clause in the above sentence because no one actually challenged me to it! I just noticed this challenge in Ash Joglekar’s twitter feed, and decided to pick it up on my own!]


A few notes:

No reviews or explanations regarding the choices of books, but still, a few notes are due—e.g., why I supply only a list and not the snaps of the covers.

1. Many of my books still remain packed up in the movers-and-packers’ boxes. These boxes are kept tightly sticking to each other and right in front of the wall-cupboard that is full of even more books (stacked up several layers deep). Since there is no place elsewhere in the house, the boxes stay there—they cannot be opened because if they are, I don’t have the space to keep those books at some other place. Further, since the boxes are heavy, I cannot easily move them aside and reach into the cupboard either. In short, these days, most of my books happen to be physically inaccessible to me. (The apartment where we currently live is too small for us.) Unless there is a strong reason for reference, the books don’t get out; they just stay where they are.

Further, I don’t have paper copies for all the books that struck me when I took up this challenge, because a couple of them I only read in the university library (i.e. the Hill library of UAB), or later on, as PDF documents (not paper copies).

For all such reasons, instead of posting the covers, here, I will supply only the titles.

2. There were other books that had struck me even more preferentially, but I decided not to include them in this list here because they were in Marathi. Drop me a line if you wish to know which ones those were.

3. All in all, I spent roughly less than 2 minutes (possibly less than 1 minute) in getting to the following list. However, later on, I decided to re-arrange it in the chronological order in which I first ran into these books. The year of my first acquaintance with the book is given in the square brackets.


The list:

  • Introduction to Objectivist Epistemology, 1e, by Ayn Rand [1981]
  • Physics (the old paperback Indian ed. with yellow-and-black cover, in 2 volumes), by Resnick and Halliday [1984]
  • In Search of Schrodinger’s Cat, 1e, by John Gribbin (i.e., the cat book, not the kitten one) [1987 or 1988]
  • Mathematical Thought from Ancient to Modern Times (3 volumes), by Morris Kline [1992]
  • Twenty Cases Suggestive of Reincarnation, by Ian Stevenson [1993]
  • Computational Physics: Problem Solving with Python, by Landau, Paez and Bordeianu [2010 or so]
  • Quantum Chemistry, by Donald McQuarrie [2011]

Afterthoughts:

  • Since the initial posting, there is a change in one of the books. Now I list the 20 cases book by Ian Stevenson instead of his 4 volumes, because I now remember that the former was what I had completely read through; the latter I had only browsed through. … Hey, others get an entire day per book, OK?
  • On second thoughts, I wanted to have Quantum Chemistry by Donald McQuarrie [17 February 2011] in there. … So I have removed a CS book which used to appear on the list (viz., Structured Computer Organization, by Andrew Tanenbaum [1995]). In fact, since McQuarrie’s book is easily accessible to me right now, I am right away posting its cover here; see below.
  • … Guess I will have to post a second list some time later on! … I mean to say, there is no book of solid or fluid mechanics in there, none on CFD or FEM… And, none on so many other topics / other authors…

 

I guess the songs section is not really necessary for this post. So I will drop it for this time round.

 

Why are NYRs so hard to keep?

Why do people at all make all those New Year Resolutions (NYRs)? Any idea? And once having made them, why do they end breaking them all so soon? Why do the NYRs turn out to be so hard to keep?

You have tried making some resolutions at least a few times in the past, haven’t you? So just think a bit about it before continuing reading further—think why they were so hard to keep. … Was it all an issue of a lack of sufficient will power? Or was something else at work here? Think about it…

My answer appears immediately below, so if you want to think a little about it, then really, stop reading right here, and come back and continue once you are ready to go further.


My answer:

People make resolutions because they want to get better, and also decide on doing something about it, like, setting a concrete goal-posts about it.

Further, I think that people fail to keep the resolutions because they make them only at the 11th hour.


A frequently offered counter-argument:

Now, you might object to the first part of my answer. “Who takes all that self-improvement nonsense so seriously anyway?” you might argue. “People make resolutions simply because it’s a very common thing to do on the New Year’s Eve. Everyone else is happy making them, and so, you are led into believing that may be, you too should have a shot at it. But really speaking, the whole thing is just a joke.”

Good attempt at finding the reasons! But not exactly a very acute analysis. Let me show you how, by tackling just this one aspect: making resolutions just because the other people are doing the same…


Following other people—what does that exactly mean?:

If someone goes on to repeat a certain thing just as everyone else is doing it, then, does this fact by itself make him a part of the herd? a fool? Really? Think about it.

Suppose you have been watching an absolutely thrilling sports match, say a one-day international cricket match. Suppose you have specially arranged for a day’s leave from your work, and you have gone with your friends to the stadium. Suppose that the team you have been rooting for wins the finals. Everyone in your group suddenly begins dancing, yells, blows horns, beats drums, and all that. Your group generally begins to have a wild celebration together. Seeing them do that, almost like within a fraction of a second, you join them, too.

Does your action mean you have been a mindless sheep following the others in your group? Does it mean that you derived no personal pleasure from the win of your team? That you yourself had no desire to express your joy, your exhilaration? Is your excitement predominantly dependent, on such an occasion, on what other people are doing? Or is it the case that the excitement and the joy is all authentically your own, but it’s just that its outer expression differs. For instance, you wouldn’t be able to go *so* wild if your boss were to be sitting in the next row, rooting for the other team! May be it’s just your outer expression which is shaped by looking at how other people celebrate at the occasion. The most you actually gather by observing others is how to express your joy—not that you have joy. (Observe how the Mexican wave works.) It’s not an instance of the herd behaviour at all!

Something similar for the NYRs too. People make resolutions because there is some underlying cause, a personal reason, as to why they want to do that. And the reason is what I already said above. Namely, that they want to get better.

Of course, it’s not that you didn’t have any point in your argument above. The influence of the other people sure is always there. But it’s a minor, incidental, thing, occurring purely at the surface.


How people actually make their resolutions:

Coming back to the NYRs, it’s a fact that around the time of the year-end, there are a great number of other people who are so busy with certain things at this time of the year: compiling all those top 10 lists (for the bygone year), buying or gifting diaries or calendars (for the new year), and, of course, making resolutions for the new year. Often, they “seriously” let you in on what resolutions they have decided, too.

If so many people were not to get so enthusiastic about making these NYRs, it’s possible, nay, even probable, that you yourself wouldn’t have thought of doing the same thing on this occasion. Possible. So, in that sense, yes, you are getting influenced by what other people do.

Yet, when it is time to take the actual action, people invariably try to figure out what is personally important to them. Not to someone else. In making resolutions, people actually don’t think too much about society, come to think of it.

No one resolves something like, for instance, that he will take a 10,000 km one-way trip in the new year, and go help some completely random couple settle some issue between them like, you know, why he spends so much money on the gadgets, or why she spends so much time on getting ready—or how they should settle their divorce agreement. People typically aren’t very enthusiastic about keeping such aims by way of New Year’s Resolutions, especially if they involve complete strangers. Even if it is true that a lot of people do resolve to undertake some humanitarian service, it’s more out of feeling of having to combine something that is good, and something that is social—or altruistic. The first element (the desire something good, to bring about some “real change”) is the more dominant motivation there, most often. And even if it is true that there are just six degrees of separation between most of the humanity, the fact of the matter still remains that while settling down on their resolution, most people usually don’t traverse even just one degree, let alone all the rest 5 (i.e. the entire society).

On the other hand, quitting drinking—or at least resolving to limit themselves to “just a couple of pegs, that’s all” is different. This one particular resolution appears very regularly near the top of people’s lists. There often seems to be this underlying sense that there is an area where they need to improve themselves. An awareness of that vague sense is then followed by a resolution, a “commitment, come what may,” sort of. To give it a good try all over once again, so to speak.


The paradox, and a bit about my recent take about it:

And yet, despite this matter being of such a personal importance, people still often fail in keeping their resolutions. Think of the usual resolutions like “regular exercise,” or “not having any more than a 90 [ml of a hard-drink] on an evening,” or “maintaining all expenses on a daily basis, and balancing bank-books regularly…” These are some of the items that regularly appear on people’s list. That’s the good part. The bad part is, the same items happen to appear on the lists of the same people year after another year.

Now, coming to the reasons for such a mass-ive (I mean wide-spread) failure, I have already given you a hint above. People typically fail, I said, because they make those resolutions at the 11th hour. They make them on the spur of the moment, often thinking them up right on the night of the 31st itself.

OK, let me note an aside here. The issue, I think is not, really speaking, one of just time. Hey, what are those new year’s diaries and planners for, except for using them at the beginning of the year? And people do use such aids for some time period at the beginning. … So, yes, time-tables and all are  involved, and people still fail to keep up.

So, the issue must be deeper than that, I thought. In any case, I have come to form one hypothesis about it.

Come to think of it, some time ago, I had jotted down my thoughts on this matter in a somewhat lighter vein. I had said: if you want to keep your resolutions, make only those which you can actually keep!

Coming back to the hypothesis which I now have, well, it is somewhat on similar lines, but in a bit more detailed, more “advanced” sort of a way. I am going to test it on myself first at the turn of this year, and I am going to see how good or poor it turns out to be (for whatever worth this idea is as a hypothesis anyway).

As a part of my testing “strategy” I will also be announcing my NYRs on the 31st (or at the most the 1st) here. Stay tuned.


Oh yes, by way of a minor update, even if I was down for a few days with minor fever and nausea, I have by now well recovered, and already am back pursuing data science. … More, later.

… Oh yes, the crackers remind me. … Happy Christmas, once again…

Will be back on the 31st or 1st. Until then, take care, and bye for now…


A song I like:
(Hindi) “Yun hi chala chal rahi”
Singers: Kailash Kher, Hariharan, Udit Narayan
Music: A. R. Rahman
Lyrics: Javed Akhtar


[Guess no need to edit this post; it’s mostly come out as pretty OK right in the first pass; will leave it as is.]

How many numbers are there in the real number system?

Post updated on 2018/04/05, 19:25 HRS IST:

See the sections added, as well as the corrected and expanded PDF attachment.


As usual, I got a bit distracted from my notes-taking (on numbers, vectors, tensors, CFD, etc.), and so, ended up writing a small “note” on the title question, in a rough-and-ready plain-text file. Today, I converted it into a LaTeX PDF. The current version is here: [^].

(I may change the document contents or its URL without informing in advance. The version “number” is the date and time given in the document itself, just below the title and the author name.)

(However, I won’t disappoint those eminent scholars who are interested in tracing my intellectual development. I will therefore keep the earlier, discarded, versions too, for some time. Here they are (in the later-to-earlier order): [^][^][ ^ ].)


This PDF note may look frivolous, and in some ways it is, but not entirely:

People don’t seem to “get” the fact that any number system other than the real number system would be capable of producing a set consisting of only distinct numbers.

They also don’t easily “get” the fact that the idea of having a distinct succession numbers is completely different from that of a continuum of them, which is what the real number system is.

The difference is as big as (and similar to) that between (the perceptually grasped) locations vs. (the perceptually grasped) motions. I guess it was Dr. Binswanger who explained these matters in one of his lectures, though he might have called them “points” or “places” instead of ”locations”. Here, as I recall, he was explaining from what he had found in good old Aristotle: An object in motion is neither here (at one certain location) nor there (in another certain location), Aristotle said; it’s state is that it is in motion. The idea of a definite place does not apply to objects in motion. That was the point Dr. Binswanger was explaining.

In short, realize where the error is. The error is in the first two words of the title question: “How many”. The phrase “how many” asks you to identify a number, but an infinity (let alone an infinity of infinity of infinity …) cannot be taken as a number. There lies the contradiction.


BTW, if you are interested, you may check out my take on the concept of space, covered via an entire series of (long) posts, some time ago. See the posts tagged “space”, here [^]


When they (the mathematicians, who else?) tell you that there are as many rational fractions as there are natural numbers, that the two infinities are in some sense “equal”, they do have a valid argument.

But typical of the modern-day mathematicians, they know, but omit to tell you, the complete story.

Since I approach mathematics (or at least the valid foundational issues in maths) from (a valid) epistemology, I can tell you a more complete story, and I will. At least briefly, right here.

Yes, the two infinities are “equal.” Yes, there are as many rational fractions as there are natural numbers. But the densities of the two (over any chosen finite interval) are not.

Take the finite interval [1.0, 101.0). There are 100 number of distinct natural numbers in them. The size of the finite interval, measured using real numbers, also is 100.o. So the density of the natural numbers over this interval is: 1.0.

But the density of the rational fractions over the same interval is far greater. In fact it is so greater that no number can at all be used to identify its size: it is infinite. (Go, satisfy yourself that this is so.)

So, your intuition that there is something wrong to Cantor’s argument is valid. (Was it he who began all this business of the measuring the “sizes” of infinite sets?)

Both the number of natural numbers and the number of rational fractions are infinities, and these infinities are of the same order, too. But there literally is an infinite difference between their local densities over finite intervals. It is  this fact that the “smart” mathematicians didn’t tell you. (Yes, you read it here first.)

At the same time, even if the “density” over the finite interval when the interval is taken “in the gross” (or as a whole) is infinite, there still are an infinite number of sub-intervals that aren’t even touched (let alone exhausted) by the infinity of these rational fractions, all of them falling only within that [1.0, 101.0) interval. Why? Because, notice, we defined the interval in terms of the real numbers, that’s why! That’s the difference between the rational fractions (or any other number-producing system) and the real numbers.


May be I will write another quick post covering some other distractions in the recent times as well, shortly. I will add the songs section at that time, to that (upcoming) post.

Bye for now.

 

Some suggested time-pass (including ideas for Python scripts involving vectors and tensors)

Actually, I am busy writing down some notes on scalars, vectors and tensors, which I will share once they are complete. No, nothing great or very systematic; these are just a few notings here and there taken down mainly for myself. More like a formulae cheat-sheet, but the topic is complicated enough that it was necessary that I have them in one place. Once ready, I will share them. (They may get distributed as extra material on my upcoming FDP (faculty development program) on CFD, too.)

While I remain busy in this activity, and thus stay away from blogging, you can do a few things:


1.

Think about it: You can always build a unique tensor field from any given vector field, say by taking its gradient. (Or, you can build yet another unique tensor field, by taking the Kronecker product of the vector field variable with itself. Or, yet another one by taking the Kronecker product with some other vector field, even just the position field!). And, of course, as you know, you can always build a unique vector field from any scalar field, say by taking its gradient.

So, you can write a Python script to load a B&W image file (or load a color .PNG/.BMP/even .JPEG, and convert it into a gray-scale image). You can then interpret the gray-scale intensities of the individual pixels as the local scalar field values existing at the centers of cells of a structured (squares) mesh, and numerically compute the corresponding gradient vector and tensor fields.

Alternatively, you can also interpret the RGB (or HSL/HSV) values of a color image as the x-, y-, and z-components of a vector field, and then proceed to calculate the corresponding gradient tensor field.

Write the output in XML format.


2.

Think about it: You can always build a unique vector field from a given tensor field, say by taking its divergence. Similarly, you can always build a unique scalar field from a vector field, say by taking its divergence.

So, you can write a Python script to load a color image, and interpret the RGB (or HSL/HSV) values now as the xx-, xy-, and yy-components of a symmetrical 2D tensor, and go on to write the code to produce the corresponding vector and scalar fields.


Yes, as my resume shows, I was going to write a paper on a simple, interactive, pedagogical, software tool called “ToyDNS” (from Toy + Displacements, Strains, Stresses). I had written an extended abstract, and it had even got accepted in a renowned international conference. However, at that time, I was in an industrial job, and didn’t get the time to write the software or the paper. Even later on, the matter kept slipping.

I now plan to surely take this up on priority, as soon as I am done with (i) the notes currently in progress, and immediately thereafter, (ii) my upcoming stress-definition paper (see my last couple of posts here and the related discussion at iMechanica).

Anyway, the ideas in the points 1. and 2. above were, originally, a part of my planned “ToyDNS” paper.


3.

You can induce a “zen-like” state in you, or if not that, then at least a “TV-watching” state (actually, something better than that), simply by pursuing this URL [^], and pouring in all your valuable hours into it. … Or who knows, you might also turn into a closet meteorologist, just like me. [And don’t tell anyone, but what they show here is actually a vector field.]


4.

You can listen to this song in the next section…. It’s one of those flowy things which have come to us from that great old Grand-Master, viz., SD Burman himself! … Other songs falling in this same sub-sub-genre include, “yeh kisine geet chheDaa,” and “ThanDi hawaaein,” both of which I have run before. So, now, you go enjoy yet another one of the same kind—and quality. …


A Song I Like:

[It’s impossible to figure out whose contribution is greater here: SD’s, Sahir’s, or Lata’s. So, this is one of those happy circumstances in which the order of the listing of the credits is purely incidental … Also recommended is the video of this song. Mona Singh (aka Kalpana Kartik (i.e. Dev Anand’s wife, for the new generation)) is sooooo magical here, simply because she is so… natural here…]

(Hindi) “phailee huyi hai sapanon ki baahen”
Music: S. D. Burman
Lyrics: Sahir
Singer: Lata Mangeshkar


But don’t forget to write those Python scripts….

Take care, and bye for now…