For the past few weeks, villagers from Karnataka have read various sentences in their native language, Kannada, into an application as a part of a project to build the country’s first AI-based chatbots for tuberculosis. There are around 40 million native Kannada speakers in India, and the language is referred to as one of the country’s 22 official languages and one of over 121 languages spoken by 10,000 people, which is more than in the world’s most populated country. However, only a few of these languages are covered by Natural Language Processing, which enables computers to identify texts and spoken words. Hence, hundreds of millions of Indians are excluded from helpful information and many economic opportunities.   

Regarding this, Kalika Bali, the principal researcher of Microsoft Research India, said, “For AI tools to work for everyone, they also need to cater to people who don’t speak English, French, or Spanish.” “But if we had to collect as much data in Indian languages as went into a large language model like GPT, we’d be waiting another ten years. So, we can create layers on top of generative AI models such as ChatGPT or Llama,” she added while talking to the Thomson Reuters Foundation.  

Bhashini, the translation system  

The villagers in Karnataka are currently among thousands of speakers of different Indian languages who generate and preserve speech data for recently established tech firms like Karya, which is building datasets for firms like Microsoft and Google to integrate AI into education, healthcare and other services. The Indian government aims to deliver tremendous digital services. They are also building language datasets through Bhashini, an AI-led language translation system creating open-source datasets in regional languages for creating AI tools.   

Bhashini involves a crowdsourcing initiative for people to contribute sentences in different languages, validate the text or audio transcribed by others, and translate text and label images. Till now, tens of thousands of Indians have actively contributed to the platform.   

Pushpak Bhattacharya, the head of computation for the Indian Language Technology Lab in Mumbai, said, “The government is pushing very strongly to create datasets to train large language models in Indian languages, and these are already in use in translation tools for education, tourism and in the courts.” “But there are many challenges: Indian languages mainly have an oral tradition, electronic records are not plentiful, and there is a lot of code-mixing. Also, to collect data in less common languages is hard and requires a special effort,” he added.   

The economic value of speech data  

Among the more than 7,000 actively used languages worldwide, the major NLPs capture fewer than 100. English is labelled as the most advanced among them. For instance, ChatGPT, rolled out last year, has initiated a wave of interest in generative AI and is predominantly trained in English. Likewise, Google Bard’s limitation to another language besides English and Amazon Alexa’s ability to respond only to three non-European languages, Arabic, Hindi and Japanese, exhibit their limitations to other languages. However, governments and startups are trying to bridge this gap.   

According to Kalika Bali, crowdsourcing can effectively collect speech and language data in India. “Crowdsourcing also helps to capture linguistic, cultural and socio-economic nuances,” Bali said. “But there has to be awareness of gender, ethnic and socio-economic bias, and it has to be done ethically by educating the workers, paying them, and making a specific effort to collect smaller languages,” she added. “Otherwise, it doesn’t scale.”  

Karya co-founder Safiya Husain later mentioned the demand for languages with the rapid growth of AI, which is still unknown, including from academics looking to preserve them.   

As per Husain, workers in Karya own a part of the data they generate so they can earn royalties, and there is potential to build AI products for the community with that data in areas such as healthcare and farming. “We see huge potential for adding economic value with speech data - an hour of Odia speech data used to cost about $3-$4, now it’s $40,” she added, referring to the language of eastern Odisha state.  

Less than 11% of India’s population is reportedly speaking English. Most of the population is not comfortable reading and writing. Hence, various AI models concentrate on speech and speech recognition.   

A few of the projects and tools actively involved in speech translation and other digital services in India include:  

  • Project Vaani- A Google-funded project for collecting speech data of about 1 million Indians and open-sourcing it for use in automatic speech recognition and speech-to-speech translation.  
  • AI-powered translation tool by EkStep Foundation- The Bangalore-based EkStep Foundation’s translation tool is deployed at the Supreme Court of India and Bangladesh.  
  • Jugalbandhi- The government-supported AI4Bharat has launched this AI-based chatbot that can answer questions on welfare schemes in several Indian languages.  
  • Gram Vaani- A social enterprise, that works with farmers. It uses AI-based chatbots to respond to queries on welfare benefits.

Sources of Article

  • https://www.thehindu.com/sci-tech/technology/india-turns-to-ai-to-capture-its-121-languages/article67603538.ece
  • Photo by Sigmund on Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE