Vinayak Abrol is an AI researcher who has done research at IIT, Mandi, Idiap Research Institute, and the University of Oxford. He is currently working as an Assistant Professor at IIIT, Delhi, focusing on functional approximation theory, harmonic analysis, random matrix theory, and information theory.

Vinayak actively works in theoretical machine learning, deep learning, and speech and audio processing.

INDIAai interviewed Vinayak to hear his perspective on artificial intelligence.

Could you describe your doctoral research area?

As a part of my doctoral study, I actively explored the theory of sparse representation, compressed sensing & related mathematical tools, and applications of these in areas of speech/audio/image signal processing. My doctoral thesis presented a generalized mathematical sparse matrix factorization framework for solving such problems. It applies to a broad range of areas that do not directly fit into the conventional supervised or unsupervised approaches for machine learning. As an extension of my PhD work, I am working to advance our understanding of deep learning approaches and develop a coherent framework for understanding its insights using the aforementioned mathematical tools.

Can you provide a list of the intriguing research issues you identified during your PhD?

-How to develop robust signal processing algorithms that can generalize well and are intuitive in explanation.

-How to incorporate the rich knowledge of speech production and auditory perception to evolve our existing data hungary machine learning models.

-Understanding a given problem from the perspective of a mathematician (fundamentals & theory), an engineer (practical algorithms) and an end-user (accessible solution). 

What is the role of an acoustic model in automatic speech recognition (ASR)?

An acoustic model establishes the relation between acoustic information in a speech signal and the corresponding linguistic unit. Thus, acoustic modelling is the first and necessary process for representing how a speaker pronounces individual sound units in a sequence. This sequence of acoustic representations predicts the probabilistically optimal sequence of units given the speech signal.

What, according to you, is the immediate challenge we have ahead in automatic speech recognition?

Misinterpretation: Current ASR systems' speech-to-text output is only close to the actual spoken content, as the models generally can't understand the context of the underlying language(s). Slangs, acronyms, similar-sounding words, punctuations, and out-of-vocabulary words while using code-mix (e.g., Hinglish - Hindi+English) make the task very challenging. Even in controlled environments, the productivity is low since only the right tone, pace, and subconsciously using the proper grammar make the flow of our conversation effective.

Can you explain the role of AI in acoustics and speech recognition?

In a nutshell, artificial intelligence (AI) deals with learning and decision-making via the acquisition and application of knowledge intelligence. Speech/audio technology falls under the broad category of machine learning tools/algorithms that makes voice/audio enabled tasks possible such as speech-to-text, text-to-speech, voice biometrics, speech translation, information retrieval etc. With AI, such existing speech technology, primarily developed independently, will soon touch almost all facets of our day-to-day life. AI can achieve this by enhancing the relationship between the current speech/audio technologies and digital devices. We already see many use cases such as voice search, dictation, language translation, managing emails or music playlists, ordering food online, or secure banking.

The global speech and voice recognition market is expected to reach USD 28.3 billion by 2026, up from USD 6.9 billion in 2018, with a CAGR of 19.8 per cent over the forecast period. What role will India play in this expansion?

India has a significant role to play if we consider the goal of achieving universal access to digital devices. Billions of people would prefer to access digital content and services in their native language(s). India is a big potential market for technology giants. Its multi-linguality poses significant challenges and provides opportunities to develop intuitive and affordable speech technology solutions for the global market. And not to forget India as the emerging software superpower attracting some of the world's largest IT corporations to open research and development centres in India.

Can you recommend some critical research articles on speech recognition that have influenced you?

● B. H. Juang and L. R. Rabiner, Hidden Markov Models for Speech Recognition, Technometrics, Vol. 33, No. 3, pp. 251-272, 1991

● Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. International Conference on Machine learning (ICML), pp 369–376, 2006.

● Li Deng, Achievements and Challenges of Deep Learning — From Speech Analysis and Recognition to Language and Multimodal Processing, Interspeech 2014

● K. Vesely, A. Ghoshal, L. Burget and D. Povey, Sequence-discriminative training of deep neural networks, Interspeech, 2013

Can you explain to us the significance of voice biometrics and pathological speech?

Voice biometrics is a technology solution that uses a person's speech sample to identify and validate their identity. Here, the main focus is on the speaker (who spoke) rather than the speech (what was spoken). As a result, voice biometrics will play a pivotal role in hands-free authentication either as a standalone system or as a part of a multi-factor authentication system in various commercial avenues such as finance, telephony, VoIP, healthcare, and hospitality.

Pathological speech processing involves building speech technology that enables intuitive social interactions for people having difficulties with communication. The study of speech pathology consists of the evaluation and treatment of speech production-related disorders affecting phonation, fluency, or intonation. However, apart from differentiating the pathological or chronic illnesses such as Down's syndrome, stuttering, stroke, Alzheimer's, or Parkinson's, such technology has potential applications in human-computer interaction, assisted living, e.g., elderly/infant care, identifying mental & behavioural traits, and understanding neuro-physiological change, e.g., due to depression or stress.

Is programming knowledge required to perform AI research? So, what programming language is needed?

Building AI systems involves describing a solution procedure (an algorithm) to a problem of interest. For a machine to understand that and perform a task, this procedure needs a formal notation called programming language. 

An AI system requires a declarative language like LISP or Prolog and an imperative language like C or Java. Nowadays, Python is popular among the research community due to being a multi-paradigm programming language.

What advice would you provide to individuals who wish to pursue a career in artificial intelligence?

Programming and applied mathematics are important to AI.

First, understand the difference between Automation (Sence and Act) and AI (Sence, Plan/Learn, and Act).

Second, be strong in your fundamentals, i.e., mathematics, probability & statistics, and logic, to develop solution procedures.

Third, learn to code and why specific algorithms are the way they are!

Fourth, narrow down your area of interest, say vision, acoustics, NLP, or finance, and learn to utilize domain knowledge.

Finally, remember if you fail, the EARTH will not stop rotating, nor your parent will stop loving you.

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in