Sunit Sivasankaran is an Applied Scientist at Microsoft. He is a specialist in the fields of machine learning and signal processing.

Sunit specializes in speech signal processing and image processing. He has also created programs for automatic sign language recognition. Currently, he is engaged in ASR acoustic modelling.

INDIAai interviewed Sunit Sivasankaran to gain his perspective on AI.

The global speech and voice recognition market will reach USD 28.3 billion by 2026, up from USD 6.9 billion in 2018, with a CAGR of 19.8 per cent over the forecast period. What significant changes have you noticed since 2018, and what changes do you expect to see by 2026?

There has been tremendous improvement in speech-related technologies across the spectrum. We can now deploy multi-speaker recognition systems and hope to get sensible transcriptions even in overlapping speech (multiple speakers speaking simultaneously). This technology transcribes meetings and summarizes them. It's almost like having an assistant write out the meeting minutes. We still have some way to go, but this technology is already usable at the moment. 

With the improvements in speech technology, its application has also increased. We now see systems which enable a deaf person to watch the news, see a movie and talk to somebody on a mobile phone. Speech technology adoption by the auto industry has also increased considerably. There have also been drastic improvements in speech synthesis that it is now almost impossible to distinguish between a human and machine synthesized speech (For example, NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality - Speech Research)

Soon, I am excited about speech technologies such as Speech to Speech translation which accepts input speech in one language and automatically converts them into multilingual speech signals. 

The European Union parliament is already in talks with stakeholders to implement the technology. For a country like ours, with so many languages, speech to speech translation has the potential to bring people together and enable better collaboration.

What motivated you to research Audio Source Separation and speech recognition?

I stumbled upon Audio source separation by chance, and I am glad I did. I first started research into audio signal processing at Late. Prof. KMM Prabhu's lab in IIT Madras during my MS program. I was primarily figuring out how to tell what kind of acoustic space an audio is recorded in, for example it could be in a forest, in an auditorium, in a train station and so on. Thanks to Prof. S. Umesh and his group, I slowly realized that this information could help reduce the impact of noise on speech processing algorithms - commonly referred to as speech enhancement in the Speech community. Noise is an undesirable yet unavoidable component of any signal and more so in speech since it distorts the properties of speech and impacts the performance of machine learning algorithms. 

A natural extension of speech enhancement is speech separation, wherein the noise is just another speech from a different speaker. So, for example, if you are speaking to a voice assistant in your home, somebody speaking in the background is just another noise. During my master's, I stumbled upon some of the work from my now doctoral supervisor Dr Emmanuel Vincent at Inria in France on Audio/speech separation.

It was a gradual, unplanned progression into the field, guided by my curiosity and an interesting set of people around me. 

Can you tell us about your PhD research area? Then, please share some of your research problems with us during your dissertation.

Speech technologies have grown leaps and bounds over the years. Nevertheless, these technologies still suffer in the presence of noise and reverberations. One particular noise which is hard to deal with is the unwanted speech from another speaker. In such a case, both speech and noise have the same characteristics, and it is hard for machine learning (ML) models to decide which one to remove. For example, if you use voice-based personal assistants, how does it know to focus on your speech and not on the speaker in a television program running in the background? This area was the focus of my thesis. 

If you have extra information about the individual you're interested in, you may be able to solve this issue. For example, people often use keywords to activate speech assistants, so we know what the speaker said. With this knowledge, we can better find the speaker's position in a room (This is referred to as speaker localization in the speech community). Once we know the speaker's part in the room, we can extract the speech spoken by the speaker who uttered the text of interest. 

In my thesis, another exciting area I explored is explainable AI for speech enhancement. In particular, I proposed a metric which measures the number of "right" components of a speech signal a deep learning model looks at while doing speech enhancement. 

What do you see as a significant drawback of speech technologies?

These technologies are pretty powerful and can have a significant impact on society. But, like anything this powerful, the effect can be both positive and negative. For example, the positive side of human-like sounding Text-to-Speech synthesis is obvious. The negative is the creation of fake speech, thereby eroding trust in society. Unfortunately, at the moment, I don't believe we have the technology to prevent the misuse of such a powerful tool. 

What is the one aspect of speech recognition that researchers have not mainly focused on, in your opinion?

The area of speech recognition has been an active area of research for at least about five decades. Many giants in this field have thought about different problems and have come up with unique solutions. As a result, it is hard to put a finger on a particular area lacking in focus. 

Nevertheless, the technologies powering today's growth in speech recognition are data-hungry deep learning models. So obviously, the best systems are the ones which use large amounts of data. Unfortunately, this approach implies that languages that do not have extensive data tend to have models that perform poorly. And the gap in performance with respect to a data-rich model is significant. This approach is true for many Indian languages as well.   

I didn't mean to say that there is no focus. There has been a lot of progress over the years resulting in substantial performance improvements. We need to continue to push the performance boundaries using cross-lingual data. We need to find a way to achieve parity to data-rich languages without investing a lot in data collection and transcribing them, which can be tedious, expensive and often not feasible.

What, in your opinion, is an essential part of AI research?

Like any research field, we stand on the shoulders of giants. Reading papers will help figure out approaches that have worked and those that failed. Patience and perseverance are key. Ideas which you thought should work, will fail and often miserably. We should dust it off, learn from it and move on. 

Having strong theoretical fundamentals is essential. Programming skills and a good understanding of computer systems are crucial as well.

What advice would you provide to those who aspire to pursue a career in artificial intelligence? What preparations should they make for the transition?

There are two dimensions you must initially consider. One is the basics, such as Linear algebra and Probability theory. Make sure you have a good understanding of the basics. Once you have your basics covered, pick a domain of interest, say Speech, Image, Video or Gaming. Get an overview of the field. Books, review papers and tutorials in conferences and workshops are good places to start. 

The other dimension is about getting hands-on and coding. I cannot emphasize this enough. Start with a topic that interests you.

Familiarize yourself with toolkits. For example, Pytorch or Tensorflow for deep learning, ESPnet/Kaldi if you are interested in speech recognition. Once you have a fair understanding, contribute to open-source tools. Most of them have a list of items that want improvement. Pick a task you are interested in and make a pull request on Github. That will give you the required feedback and access to people who are experts in their respective fields. Most open-source tools are always on the lookout for developers, and it is a win-win situation if you contribute. 

Another way to get hands-on, especially if you do not have access to data or computing resources, is to start with pre-trained models. There are lots of pre-trained models available on HuggingFace. Please pick up a model and start building applications with it.  

Could you please list some of the top research publications and books on artificial intelligence?

For Linear Algebra, the book (introduction to linear algebra) and video lecture series by Prof. Gilbert Strang. In addition, there is an excellent course by Prof. Krishna Jagannathan from IIT-Madras in NPTEL for Probability theory. For speech recognition, Fundamentals of speech recognition by Lawrence Rabiner. For deep learning, "Deep learning" by Aaron Courville, Ian Goodfellow and Yoshua Bengio. Finally, for audio source separation, "Audio source separation and speech enhancement" by Emmanuel Vincent, Tuomas Virtanen and Sharon Gannot. 

Apart from books and publications, podcasts such as Machine Learning Street Talk and following researchers on social media platforms like Twitter have helped me know the latest. Often people publish a pre-print of their paper in Arxiv. Arxiv-sanity is an excellent tool for keeping track of such pre-prints. Following your peers on Github can also be helpful.

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in