Sound separation, voice separation from a song, the cocktail party effect, etc., and AI/ML

A Special note for the Potential Employers from the Data Science field:

Recently, in April 2020, I achieved a World Rank # 5 on the MNIST problem. The initial announcement can be found here [^], and a further status update, here [^].

All my data science-related posts can always be found here [^].

1. The “Revival” series of “Sa Re Ga Ma” music company:

It all began with the two versions of the song which I ran the last time (in the usual songs section). The original song (or so I suppose) is here [^]. The “Revival” edition of the same song is here [^]. For this particular song, I happened to like the “Revival” version just so slightly better. It seemed to “fill” the tonal landscape of this song better—without there being too much of degradation to the vocals, which almost always happens in the “Revival” series. I listened to both these versions back-to-back, a couple of times. Which set me thinking.

I thought that, perhaps, by further developing the existing AI techniques and using them together with some kind of advanced features for manual editing, it should be possible to achieve the same goals that the makers of “Revival” were aiming at, but too often fell too short of.

So, I did a bit of Google search, and found discussions like this one at HiFiVision [^], and this one at [^]. Go through the comments; they are lively!

On the AI side, I found this Q&A at Quora: [^]. One of the comments said: “Just Google this for searching `software to clarify and improve sound recordings’ and you will have several thousand listings for software that does this job.”

Aha! So there already were at least a few software to do the job?

A few more searches later, I landed at Spleeter by Deezer [^]. (In retrospect, this seems inevitable.)

2. AI software for processing songs: Spleeter:

Spleeter uses Python and TensorFlow, and the installation instructions assume Conda. So, I immediately tried to download it, but saw my Conda getting into trouble “solving” for the “environment”. Tch. Too bad! I stopped the Conda process, closed the terminal window, and recalled the sage advice: “If everything fails, refer to the manual.” … Back to the GitHub page on Deezer, and it’s only now that I find that they still use TF 1.x! Bad luck!

Theoretically, I could have run Spleeter on Google Colab. But I don’t like Google Colab. Nothing basically wrong with it, except that (i) personally, I think that Jupyter is a great demo tool but a very poor development tool, and (ii) I like my personal development projects unclouded. (I will happily work on a cloud-based project if someone pays me to do it. For that matter, if someone is going to pay me and then also advices me to set up the swap space for my local Ubuntu installation on the cloud, I will happily do it too! But when it comes to my own personal projects, don’t expect me to work on the cloud.)

So, tinkering around with Spleeter was, unfortunately, not possible. But I noticed what they add at the Spleeter’s page: “there are multiple forks exposing spleeter through either a Guided User Interface (GUI) or a standalone free or paying website. Please note that we do not host, maintain or directly support any of these initiatives.” Good enough for me. I mean, the first part…

So, after a couple of searches, I landed at a few sites, and actually tried two of them: [^] and [^]. The song I tried was the original version of the same song I ran the last time (links to YouTube given above). I went for just two stems in both cases—vocals and the rest.

Ummm…. Not quite up to what I had naively expected. Neither software did too well, IMO. Not with this song.

3. Peculiarity of the Indian music:

The thing is this: Indian music—rather, Indian film (and similar) music—tends to be very Indian in its tastes.

We Indians love curries—everything imaginable thrown together. No 5 or 7 or 9 or 11 course dinners for us. All our courses are packed into a big simultaneous serving called “thaali”. Curries are a very convenient device in accomplishing this goal of serving the whole kitchen together. (BTW, talking of Western dinners, it seems to me that they always have an odd number of courses in dinners. Why so? why not an even number of courses? Is it something like that 13 thing?)

Another thing. Traditionally, Indian music has made use of many innovations, but ideas of orchestra and harmony have never been there. Also, Indian music tends to emphasize continuity of sounds, whereas Western music seems to come in a “choppy” sort of style. In Western music, there are clear-cut demarcations at every level, starting from very neatly splitting a symphony into movements, right down to individual phrases and the smallest pieces of variations. Contrast it all with the utter fluidity and flexibility with which Indian music (especially the vocal classical type) gets rendered.

On both counts, Indian music comes across as a very fluid and continuous kind of a thing: a curry, really speaking.

All individual spices (“masaalaa” items) are not just thrown together, they are pounded and ground together into a homogenous ball first. That’s why, given a good home-made curry, it is tough to even take a good guess at exactly what all spices might have gone into making it. …Yes, even sub-regional variations are hard to figure out, even for expert noses. Just ask a lady from North Maharashtra or Marathwada to name all the spices used in a home-made curry from the north Konkan area of Maharashtra state (which is different from the Malavani sub-region of Konkan), going just by the taste. These days, communications have improved radically, and so, people know the ingredients already. But when I was young, women of my mother’s age would typically fail to guess the ingredients right. So, curries come with everything utterly mixed-up together. The whole here is not just greater than the sum of its parts; the whole here is one whose parts cannot even be teased out all that easily.

In India, our songs too are similar to curries. Even when we use Western instruments and orchestration ideas, the usage patterns still tend to become curry-like. Which means, Indian songs are not at all suitable for automatic/mechanically conducted analysis!

That’s why, the results given by the two services were not very surprising. So, I let it go at that.

But the topic wasn’t prepared to let me go so easily. It kept tugging at my mind.

4. Further searches:

Today, I gave in, and did some further searches. I found Spleeter’s demo video [^]. Of course, there could be other, better demo’s too. But that’s what I found and pursued. I also found a test of Spleeter done on Beatles’ song: “Help” [^].

Finally, I also found this video which explains how to remove vocals from a song using Audacity[^]. Skip it if you wish, but it was this video which mentioned Melodyne [^], which was a new thing to me. Audacity is Open Source, whereas Melodyne is a commercial product.

Further searches later, I also found this video (skip it if you don’t find all these details interesting), using ISSE [^]. Finally, I found this one (and don’t skip it—there’s a wealth of information in it): [^]. Among many things, it also mentions AutoTune [^], a commercial product. Google search suggested AutoTalent as its Open Source alternative; it was written by an MIT prof [^]. I didn’t pursue it a lot, because my focus was on vocals-extraction rather than vocals pitchcorrection.

Soooooo, where does that leave us?

Without getting into all the details, let me just state a few conclusions that I’ve reached…

5. My conclusions as of today:

5.1. Spleeter and similar AI/ML-based techniques need to improve a lot. Directly offering voice-separation services is not likely to take the world by the storm.

5.2. Actually, my position is better stated by saying this: I think that directly deploying AI/ML in the way it is being deployed, isn’t going to work out—at all. Just throwing tera-bytes of data at the problem isn’t going to solve it. Not because the current ML techniques aren’t very capable, but because music is far too complex. People are aiming for too high-hanging a fruit here.

5.3. At the same time, I also think that in a little more distant future, say over a 5–10 years’ horizon, chances are pretty good that tasks like separating the voice from the instrumental sounds would become acceptably good. Provided, they pursue the right clues.

6. How the development in this field should progress (IMO):

In this context, IMO, a good clue is this: First of all, don’t start with AI/ML, and don’t pursue automation. Instead, start with a very good idea of what problem(s) are at all to be solved.

In the present context, it means: Try to understand why products like Melodyne and AutoTune have at all been successful—despite there being “so little automation” in them.

My answer: Precisely because these software have given so much latitude to the user.

It’s only after understanding the problem to be solved, and the modalities of current software, that we should come to this question of whether the existing capabilities of these software can at all be enhanced using AI/ML, using one feature/aspect at a time.

My answer, in short: Yes, they can (and should) be.

Notice, we don’t start with the AI/ML algorithms and then try to find applications for them. We start with some pieces of good software that have already created certain needs (or expanded them), and are fulfilling them already. Only then do we think of enhancing it—with AI/ML being just a tool. Yes, it’s an enabling technology. But in the end, it’s just a tool to improve other software.

Then, as the next step, consolidate all possible AI-related gains first—doing just enhancements, really speaking. Only then proceed further. In particular, don’t try to automate everything right from the beginning.

IMO, AI/ML techniques simply aren’t so well developed that they can effectively tackle problems involving relatively greater conceptual scope, such that wide-ranging generalizations get involved in complex ways, in performing a task. AI/ML techniques can, and indeed do, excel—and even outperform people—but only in those tasks that are very narrowly defined—tasks like identifying handwritten digits, or detecting few cancerous cells from among hundreds of healthy cells using details of morphology—without ever getting fatigued. Etc.

Sound isolation is not a task very well suited to these algorithms. Not at this stage of development of AI/ML, and the sound-related softwares.

“The human element will always be there,” people love to repeat in the context of AI.

Yes, I am an engineer and I am very much into AI/ML. But when it comes to tasks like sound separation and all, my point is stronger than what people say. IMO, it would be actually stupid to throw away the manual editing aspects.

Human ear is too sensitive an instrument; it takes almost nothing (and almost no time) for most of us to figure out when some sound processing goes bad, or when a reverse-FFT’ed sound begins to feel too shrill at times, or too “hollow” at other times, or plain “helium-throat”-like at still others [^][^].

Gather some 5–10 people in a room and play some songs on a stereo system that is equipped with a good graphic equalizer. If there is no designated DJ, what is the bet that people are just going to fiddle around with the graphic equalizer every time a new song begins? The reason is not just that people want to impress others with their expertize. The fact of the matter is, people are sensitive to even minutest variations that come in sound, and they will simply not accept something which does not sound “just right.” Further, there are individual tastes too—as to what precisely is “just right”. That’s why, if one guy increases the bass, someone else is bound to get closer to the graphic equalizer to set it right! It’s going to happen.

That’s why, it’s crucial not to even just attempt to “minimize” the human “interference.” Don’t do that. Not for software like these.

Instead, the aim should be to keep that human element right at the forefront, despite using AI/ML.

Ditto, for other similarly complex tasks / domains, like colouring B&W images, generating meaningful passages of text, etc.

That’s what I think. As of today.

7. Guess I also have some ideas for processing of music:

So, yes, I am not at all for directly starting training Deep Learning models with lots of music tracks.

At the same time, I guess, I also have already started generating quite a few ideas regarding these topics: analysis of music, sound separation, which ML technique might work out well and in what respect (for these tasks), what kind of abstract information to make available to the human “operator” and in what form/presentation, etc. …

…You see, some 15+ years ago, I had actually built a small product called “ToneBrush.” It offered real-time visualizations of music using windowed FFT and further processing (something like spectrogram and the visualizations which by now have become standard in all media players like VLC etc.). My product didn’t even sell just a single copy! But it was a valuable experience…

…Cutting back to the present, all the thinking which I did back then, now came back to once again. … All the same, for the time being, I’m just noting these ideas in my notebook, and otherwise moving this whole topic on to the back-burner. I first want to finish my ongoing work on QM, first of all.

One final note, an after-thought, actually: I didn’t say anything about the cocktail party effect. Well, if you don’t know what the effect means, start with the Wiki [^]. As to its relevance: I remember seeing some work (I think it was mentioned at Google’s blog) which tried to separate out each speaker’s voice from the mixed up signals coming from, say, round-table discussion kind of scenarios. However, I am unable to locate it right now. So, let me leave it as “homework” for you!

8. On the QM front:

As to my progress on the QM side—or lack of it: I spotted (and even recalled from memory) quite a few more conceptual issues, and have been working through them. The schedule might get affected a bit, but not a lot. Instead of 3–4 weeks which I announced 1–2 weeks ago, these additional items add a further couple of weeks or so, but not more. So, instead of August-end, I might be ready by mid-September. Overall, I am quite happy with the way things are progressing in this respect. However, I’ve realized that this work is not like programming. If I work for more than just 7–8 hours a day, then I still get exhausted. When it’s programming and not just notes/theory-building, then I can easily go past 12 hours a day, consistently, for a lot longer period (like weeks at a time). So, things are moving more slowly, but quite definitely, and I am happy with the progress so far. Let’s see.

In the meanwhile, of course, thoughts on topics like coloring of B&W pics or sound separation also pass by, and I note them.

OK then, enough is enough. See you after 10–15 days. In the meanwhile, take care, and bye for now…

A song I like:

(Western, Pop) “Rhiannon”
Band: Fleetwood Mac

[I mean this version: [^]. This song has a pretty good melody, and Stevie Nicks’s voice, even if it’s not too smooth and mellifluous, has a certain charm to it, a certain “femininity” as it were. But what I like the most about this song is, once again, its sound-scape taken as a whole. Like many songs of its era, this song too carries a right level of richness in its tonal land-scape—neither too rich nor too sparse/rarefied, but just right. …If I recall it right, surprisingly, the first time I heard this song was not in the COEP hostels, but in the IIT Madras hostels.]

— 2020.08.09 02:43 IST: First published
— 2020.08.09 20:33 IST: Fixed the wrong linkings and the broken links, and added reference to AutoTalent.


Data Science links—1

Oakay… My bookmarks library has grown too big. Time to move at least a few of them to a blog-post. Here they are. … The last one is not on Data Science, but it happens to be the most important one of them all!

On Bayes’ theorem:

Oscar Bonilla. “Visualizing Bayes’ theorem” [^].

Jayesh Thukarul. “Bayes’ Theorem explained” [^].

Victor Powell. “Conditional probability” [^].

Explanations with visualizations:

Victor Powell. “Explained Visually.” [^]

Christopher Olah. Many topics [^]. For instance, see “Calculus on computational graphs: backpropagation” [^].

Fooling the neural network:

Julia Evans. “How to trick a neural network into thinking a panda is a vulture” [^].

Andrej Karpathy. “Breaking linear classifiers on ImageNet” [^].

A. Nguyen, J. Yosinski, and J. Clune. “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images” [^]

Melanie Mitchell. “Artificial Intelligence hits the barrier of meaning” [^]

The Most Important link!

Ijad Madisch. “Why I hire scientists, and why you should, too” [^]

A song I like:

(Western, pop) “Billie Jean”
Artist: Michael Jackson

[Back in the ’80s, this song used to get played in the restaurants from the Pune camp area, and also in the cinema halls like West-End, Rahul, Alka, etc. The camp area was so beautiful, back then—also uncrowded, and quiet.

This song would also come floating on the air, while sitting in the evening at the Quark cafe, situated in the middle of all the IITM hostels (next to skating rink). Some or the other guy would be playing it in a nearby hostel room on one of those stereo systems which would come with those 1 or 2 feet tall “hi-fi” speaker-boxes. Each box typically had three stacked speakers. A combination of a separately sitting sub-woofer with a few small other boxes or a soundbar, so ubiquitous today, had not been invented yet… Back then, Quark was a completely open-air cafe—a small patch of ground surrounded by small trees, and a tiny hexagonal hut, built in RCC, for serving snacks. There were no benches, even, at Quark. People would sit on those small concrete blocks (brought from the civil department where they would come for testing). Deer would be roaming very nearby around. A daring one or two could venture to come forward and eat pizza out of your (fully) extended hand!…

…Anyway, coming back to the song itself, I had completely forgotten it, but got reminded when @curiouswavefn mentioned it in one of his tweets recently. … When I read the tweet, I couldn’t make out that it was this song (apart from Bach’s variations) that he was referring to. I just idly checked out both of them, and then, while listening to it, I suddenly recognized this song. … You see, unlike so many other guys of e-schools of our times, I wouldn’t listen to a lot of Western pop-songs those days (and still don’t). Beatles, ABBA and a few other groups/singers, may be, also the Western instrumentals (a lot) and the Western classical music (some, but definitely). But somehow, I was never too much into the Western pop songs. … Another thing. The way these Western singers sing, it used to be very, very hard for me to figure out the lyrics back then—and the situation continues mostly the same way even today! So, recognizing a song by its name was simply out of the question….

… Anyway, do check out the links (even if some of them appear to be out of your reach on the first reading), and enjoy the song. … Take care, and bye for now…]


But I shall not ever do promise…

But I shall not ever do promise that I shall not write a blog-post such as my last one! [^]

A song I like:
I am in a fix. There are two songs, both sung by a Pune-based (and fortunately) little-known lady (also, unfortunately, a caste-Brahmin, going by the surname and all, obviously). I will pick one up at random and continue to run the show in here for now. (As it so happens, the other song, I had already run, though I found it worth repeating too. It’s just that I am not going to repeat it right now.) Here is the one I have in mind for today:

(Marathi) “sunyaa sunyaa mehafilit maajhyaa …”
Lyrics: Suresh Bhat (aided in a small measure, I now gather, by Jabbar Patel, and then, finally, by Shanta Shelke)
Music: Hridaynath Mangeshkar
Singer: Devaki Pandit (as recorded in the Sahyaadri Doordarshan studio).

[Personal comments: Yes, I loved the tune of it, right from the first time I heard it. No, when I first heard it, the lyrics simply didn’t make any sense. (Some of them do, now!). No, for whatever my opinion is worth, Lata’s rendering had always sounded a bit too shrill (or “karkashsha”) to me. … To the point that I had come to slot this song down in my list, always. (Ditto, for the theme of the movie in which it appeared—though not the actress or her acting in this movie.) But when I heard this version (just recently), I then began liking this entire song too.

But speaking of other things, yes, I again got rejected today in a job application. Within minutes. For an MNC (probably an American-owned). By an IIT K-trained (highly junior) guy named Jain. (Not sure about his own competence. Don’t have any idea about it—not even after having gone through his LinkedIn profile.)

So, I am completely jobless, anyway.

Stay tuned for further updates. I shall write. On whatever it is about which I want to write.