Non-Interview Questions on Data Science—Part 1

This entry is the first in a series of posts which will note some of the questions that no one will ever ask you during any interview for any position in the Data Science industry.

Naturally, if you ask for my opinion, you should not consider modifying these questions a bit and posting them as a part of your own post on Medium.com, AnalyticsVidhya, KDNuggets, TowardsDataScience, ComingFromDataScience, etc.

No, really! There would be no point in lifting these questions and posting them as if they were yours, because no one in the industry is ever going to get impressed by you because you raised them. … I am posting them here simply because… because “I am like that only.”

OK, so here is the first installment in this practically useless series. (I should know. I go jobless.)

(Part 1 mostly covers linear and logistic regression, and just a bit of probability.)

Q.1: Consider the probability theory. How are the following ideas related to each other?: random phenomenon, random experiment, trial, result, outcome, outcome space, sample space, event, random variable, and probability distribution. In particular, state precisely the difference between a result and an outcome, and between an outcome and an event.

Give a few examples of finite and countably infinite sample spaces. Give one example of a random variable whose domain is not the real number line. (Hint: See the Advise at the end of this post concerning which books to consult.)

Q.2: In the set theory, when a set is defined through enumeration, repeated instances are not included in the definition. In the light of this fact, answer the following question: Is an event a set? or is it just a primitive instance subsumed in a set? What precisely is the difference between a trial, a result of a trial, and an event? (Hint: See the Advise at the end of this post concerning which books to consult.)

Q.3: Select the best alternative: In regression for making predictions with a continuous target data, if a model is constructed in reference to the equation $y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3$, then:
(a) It is a sub-type of the linear regression model.
(b) It is a polynomial regression model.
(c) It is a nonlinear regression model because powers $> 1$ of the independent variable $x_i$ are involved.
(d) It is a nonlinear regression model because more than two $\beta_m$ terms are involved.
(e) Both (a) and (b)
(g) Both (b) and (c)
(f) Both (c) and (d)
(g) All of (b), (c), and (d)
(h) None of the above.
(Hint: Don’t rely too much on the textbooks being used by the BE (CS) students in the leading engineering colleges in Pune and Mumbai.)

Q.4: Consider a data-set consisting of performance of students on a class test. It has three columns: student ID, hours studied, and marks obtained. Suppose you decide to use the simple linear regression technique to make predictions.

Let’s say that you assume that the hours studied are the independent variable (predictor), and the marks obtained are the dependent variable (response). Making this assumption, you make a scatter plot, carry out the regression, and plot the regression line predicted by the model too.

The question now is: If you interchange the designations of the dependent and independent variables (i.e., if you take the marks obtained as predictors and the hours studied as responses), build a second linear model on this basis, and plot the regression line thus predicted, will it coincide with the line plotted earlier or not. Why or why not?

Repeat the question for the polynomial regression. Repeat the question if you include the simplest interaction term in the linear model.

Q.5: Draw a schematic diagram showing circles for nodes and straight-lines for connections (as in the ANN diagrams) for a binary logistic regression machine that operates on just one feature. Wonder why your text-book didn’t draw it in the chapter on the logistic regression.

Q.6: Suppose that the training input for a classification task consists of $r$ number of distinct data-points and $c$ number of features. If logistic regression is to be used for classification of this data, state the number of the unknown parameters there would be. Make suitable assumptions as necessary, and state them.

Q.7: Obtain (or write) some simple Python code for implementing from the scratch a single-feature binary logistic regression machine that uses the simple (non-stochastic) gradient descent method that computes the gradient for each row (batch-size of 1).

Modify the code to show a real-time animation of how the model goes on changing as the gradient descent algorithm progresses. The animation should depict a scatter plot of the sample data ($y$ vs. $x$) and not the parameters space ($\beta_0$ vs. $\beta_1$). The animation should highlight the data-point currently being processed in a separate color. It should also show a plot of the logistic function on the same graph.

Can you imagine, right before running (or even building) the animation, what kind of visual changes is the animation going to depict? how?

Q.8: What are the important advantage of the stochastic gradient descent method over the simple (non-stochastic) gradient descent?

Q.9: State true or false: (i) The output of the logistic function is continuous. (ii) The minimization of the cost function in logistic regression involves a continuous dependence on the undetermined parameters.

In the light of your answers, explain the reason why the logistic regression can at all be used as a classification mechanism (i.e. for targets that are “discrete”, not continuous). State only those axioms of the probability theory which are directly relevant here.

Q.10: Draw diagrams in the parameters-space for the Lasso regression and the Ridge regression. The question now is to explain precisely what lies inside the square or circular region. In each case, draw an example path that might get traced during the gradient descent, and clearly explain why the progress occurs the way it does.

Q.11: Briefly explain how the idea of the logistic regression gets applied in the artificial neural networks (ANNs). Suppose that a training data-set has $c$ number of features, $r$ number of data-rows, and $M$ number of output bins (i.e. classification types). Assuming that the neural network does not carry any hidden layers, calculate the number of logistic regressions that would be performed in a single batch. Make suitable assumptions as necessary.

Q.12: State the most prominent limitation of the gradient descent methods. State the name of any one technique which can overcome this limitation.

Advise: To answer the first two questions, don’t refer to the programming books. In fact, don’t even rely too much on the usual textbooks. Even Wasserman skips over the topic and Stirzaker is inadquate. Kreyszig is barely OK. A recommended text (more rigorous but UG-level, and brief) for this topic is: “An Introduction to Probability and Statistics” (2015) Rohatgi and Saleh, Wiley.

Awww… Still with me?

If you read this far, chances are very bright that you are really^{really} desperately looking for a job in the data science field. And, as it so happens, I am also a very, very kind hearted person. I don’t like to disappoint nice, ambitious… err… “aspiring” people. So, let me offer you some real help before you decide to close this page (and this blog) forever.

Here is one question they might actually ask you during an interview—especially if the interviewer is an MBA:

A question they might actually ask you in an interview: What are the three V’s of big data? four? five?

(Yes, MBA’s do know arithmetic. At least, it was there on their CAT / GMAT entrance exams. Yes, you can use this question for your posts on Medium.com, AnalyticsVidhya, KDNuggets, TowardsDataScience, ComingFromDataScience, etc.)

A couple of notes:

1. I might come back and revise the questions to make them less ambiguous or more precise.
2. Also, please do drop a line if any of the questions is not valid, or shows a poor understanding on my part—this is easily possible.

A song I like:

[Credits listed in a random order. Good!]

(Hindi) “mausam kee sargam ko sun…”
Music: Jatin-Lalit
Singer: Kavita Krishnamoorthy
Lyrics: Majrooh Sultanpuri

History:

First written: Friday 14 June 2019 11:50:25 AM IST.
Published online: 2019.06.16 12:45 IST.
The songs section added: 2019.06.16 22:18 IST.

The seven books challenge—my list

“Accepted challenge to post covers of 7 books I love: no explanations, no reviews – just the cover”

You might have run into tweets of the above kind in the recent past. Here, I would like to accept that challenge. [Unlike those tweets, there is no “from” clause in the above sentence because no one actually challenged me to it! I just noticed this challenge in Ash Joglekar’s twitter feed, and decided to pick it up on my own!]

A few notes:

No reviews or explanations regarding the choices of books, but still, a few notes are due—e.g., why I supply only a list and not the snaps of the covers.

1. Many of my books still remain packed up in the movers-and-packers’ boxes. These boxes are kept tightly sticking to each other and right in front of the wall-cupboard that is full of even more books (stacked up several layers deep). Since there is no place elsewhere in the house, the boxes stay there—they cannot be opened because if they are, I don’t have the space to keep those books at some other place. Further, since the boxes are heavy, I cannot easily move them aside and reach into the cupboard either. In short, these days, most of my books happen to be physically inaccessible to me. (The apartment where we currently live is too small for us.) Unless there is a strong reason for reference, the books don’t get out; they just stay where they are.

Further, I don’t have paper copies for all the books that struck me when I took up this challenge, because a couple of them I only read in the university library (i.e. the Hill library of UAB), or later on, as PDF documents (not paper copies).

For all such reasons, instead of posting the covers, here, I will supply only the titles.

2. There were other books that had struck me even more preferentially, but I decided not to include them in this list here because they were in Marathi. Drop me a line if you wish to know which ones those were.

3. All in all, I spent roughly less than 2 minutes (possibly less than 1 minute) in getting to the following list. However, later on, I decided to re-arrange it in the chronological order in which I first ran into these books. The year of my first acquaintance with the book is given in the square brackets.

The list:

• Introduction to Objectivist Epistemology, 1e, by Ayn Rand [1981]
• Physics (the old paperback Indian ed. with yellow-and-black cover, in 2 volumes), by Resnick and Halliday [1984]
• In Search of Schrodinger’s Cat, 1e, by John Gribbin (i.e., the cat book, not the kitten one) [1987 or 1988]
• Mathematical Thought from Ancient to Modern Times (3 volumes), by Morris Kline [1992]
• Twenty Cases Suggestive of Reincarnation, by Ian Stevenson [1993]
• Computational Physics: Problem Solving with Python, by Landau, Paez and Bordeianu [2010 or so]
• Quantum Chemistry, by Donald McQuarrie [2011]

Afterthoughts:

• Since the initial posting, there is a change in one of the books. Now I list the 20 cases book by Ian Stevenson instead of his 4 volumes, because I now remember that the former was what I had completely read through; the latter I had only browsed through. … Hey, others get an entire day per book, OK?
• On second thoughts, I wanted to have Quantum Chemistry by Donald McQuarrie [17 February 2011] in there. … So I have removed a CS book which used to appear on the list (viz., Structured Computer Organization, by Andrew Tanenbaum [1995]). In fact, since McQuarrie’s book is easily accessible to me right now, I am right away posting its cover here; see below.
• … Guess I will have to post a second list some time later on! … I mean to say, there is no book of solid or fluid mechanics in there, none on CFD or FEM… And, none on so many other topics / other authors…

I guess the songs section is not really necessary for this post. So I will drop it for this time round.

Why are NYRs so hard to keep?

Why do people at all make all those New Year Resolutions (NYRs)? Any idea? And once having made them, why do they end breaking them all so soon? Why do the NYRs turn out to be so hard to keep?

You have tried making some resolutions at least a few times in the past, haven’t you? So just think a bit about it before continuing reading further—think why they were so hard to keep. … Was it all an issue of a lack of sufficient will power? Or was something else at work here? Think about it…

My answer appears immediately below, so if you want to think a little about it, then really, stop reading right here, and come back and continue once you are ready to go further.

People make resolutions because they want to get better, and also decide on doing something about it, like, setting a concrete goal-posts about it.

Further, I think that people fail to keep the resolutions because they make them only at the 11th hour.

A frequently offered counter-argument:

Now, you might object to the first part of my answer. “Who takes all that self-improvement nonsense so seriously anyway?” you might argue. “People make resolutions simply because it’s a very common thing to do on the New Year’s Eve. Everyone else is happy making them, and so, you are led into believing that may be, you too should have a shot at it. But really speaking, the whole thing is just a joke.”

Good attempt at finding the reasons! But not exactly a very acute analysis. Let me show you how, by tackling just this one aspect: making resolutions just because the other people are doing the same…

Following other people—what does that exactly mean?:

If someone goes on to repeat a certain thing just as everyone else is doing it, then, does this fact by itself make him a part of the herd? a fool? Really? Think about it.

Suppose you have been watching an absolutely thrilling sports match, say a one-day international cricket match. Suppose you have specially arranged for a day’s leave from your work, and you have gone with your friends to the stadium. Suppose that the team you have been rooting for wins the finals. Everyone in your group suddenly begins dancing, yells, blows horns, beats drums, and all that. Your group generally begins to have a wild celebration together. Seeing them do that, almost like within a fraction of a second, you join them, too.

Something similar for the NYRs too. People make resolutions because there is some underlying cause, a personal reason, as to why they want to do that. And the reason is what I already said above. Namely, that they want to get better.

Of course, it’s not that you didn’t have any point in your argument above. The influence of the other people sure is always there. But it’s a minor, incidental, thing, occurring purely at the surface.

How people actually make their resolutions:

Coming back to the NYRs, it’s a fact that around the time of the year-end, there are a great number of other people who are so busy with certain things at this time of the year: compiling all those top 10 lists (for the bygone year), buying or gifting diaries or calendars (for the new year), and, of course, making resolutions for the new year. Often, they “seriously” let you in on what resolutions they have decided, too.

If so many people were not to get so enthusiastic about making these NYRs, it’s possible, nay, even probable, that you yourself wouldn’t have thought of doing the same thing on this occasion. Possible. So, in that sense, yes, you are getting influenced by what other people do.

Yet, when it is time to take the actual action, people invariably try to figure out what is personally important to them. Not to someone else. In making resolutions, people actually don’t think too much about society, come to think of it.

No one resolves something like, for instance, that he will take a 10,000 km one-way trip in the new year, and go help some completely random couple settle some issue between them like, you know, why he spends so much money on the gadgets, or why she spends so much time on getting ready—or how they should settle their divorce agreement. People typically aren’t very enthusiastic about keeping such aims by way of New Year’s Resolutions, especially if they involve complete strangers. Even if it is true that a lot of people do resolve to undertake some humanitarian service, it’s more out of feeling of having to combine something that is good, and something that is social—or altruistic. The first element (the desire something good, to bring about some “real change”) is the more dominant motivation there, most often. And even if it is true that there are just six degrees of separation between most of the humanity, the fact of the matter still remains that while settling down on their resolution, most people usually don’t traverse even just one degree, let alone all the rest 5 (i.e. the entire society).

On the other hand, quitting drinking—or at least resolving to limit themselves to “just a couple of pegs, that’s all” is different. This one particular resolution appears very regularly near the top of people’s lists. There often seems to be this underlying sense that there is an area where they need to improve themselves. An awareness of that vague sense is then followed by a resolution, a “commitment, come what may,” sort of. To give it a good try all over once again, so to speak.

And yet, despite this matter being of such a personal importance, people still often fail in keeping their resolutions. Think of the usual resolutions like “regular exercise,” or “not having any more than a 90 [ml of a hard-drink] on an evening,” or “maintaining all expenses on a daily basis, and balancing bank-books regularly…” These are some of the items that regularly appear on people’s list. That’s the good part. The bad part is, the same items happen to appear on the lists of the same people year after another year.

Now, coming to the reasons for such a mass-ive (I mean wide-spread) failure, I have already given you a hint above. People typically fail, I said, because they make those resolutions at the 11th hour. They make them on the spur of the moment, often thinking them up right on the night of the 31st itself.

OK, let me note an aside here. The issue, I think is not, really speaking, one of just time. Hey, what are those new year’s diaries and planners for, except for using them at the beginning of the year? And people do use such aids for some time period at the beginning. … So, yes, time-tables and all are  involved, and people still fail to keep up.

So, the issue must be deeper than that, I thought. In any case, I have come to form one hypothesis about it.

Come to think of it, some time ago, I had jotted down my thoughts on this matter in a somewhat lighter vein. I had said: if you want to keep your resolutions, make only those which you can actually keep!

Coming back to the hypothesis which I now have, well, it is somewhat on similar lines, but in a bit more detailed, more “advanced” sort of a way. I am going to test it on myself first at the turn of this year, and I am going to see how good or poor it turns out to be (for whatever worth this idea is as a hypothesis anyway).

As a part of my testing “strategy” I will also be announcing my NYRs on the 31st (or at the most the 1st) here. Stay tuned.

Oh yes, by way of a minor update, even if I was down for a few days with minor fever and nausea, I have by now well recovered, and already am back pursuing data science. … More, later.

… Oh yes, the crackers remind me. … Happy Christmas, once again…

Will be back on the 31st or 1st. Until then, take care, and bye for now…

A song I like:
(Hindi) “Yun hi chala chal rahi”
Singers: Kailash Kher, Hariharan, Udit Narayan
Music: A. R. Rahman
Lyrics: Javed Akhtar

[Guess no need to edit this post; it’s mostly come out as pretty OK right in the first pass; will leave it as is.]