In a groundbreaking collaboration, IIT Madras, AI4Bharat, and Sarvam AI have unveiled IndicVoices, India's first comprehensive speech dataset.

Designed to represent India's vast linguistic, cultural, and demographic diversity, IndicVoices marks a significant milestone in advancing speech recognition and artificial intelligence (AI) in multilingual contexts. With 12,000 hours of speech data from 16,237 speakers across 208 Indian districts and 22 languages, the initiative is a testament to India's technological and collaborative prowess.

Multilingual Speech Data Collection

IndicVoices is a dataset of natural and spontaneous speech containing a total of 12000 hours of read (8%), extempore (76%) and conversational (15%) audio from 22563 speakers covering 208 Indian districts and 22 languages. Of these 12000 hours, 3200 hours have been transcribed, with a median of 122 hours per language. The initiative has already transcribed 1,639 hours of this dataset, with a median of 73 hours per language, ensuring equitable representation.

The project employed a robust, scalable framework involving 1,893 personnel and cutting-edge tools such as:

  • Digital Interaction Prompts: Curated questions and prompts to generate authentic and meaningful speech samples.
  • Mobile Applications: Android-based tools for on-field data collection and real-time verification.
  • Workflow Management Platforms: Web-based systems to streamline transcription and quality control.

The open-source protocols, transcription guidelines, and quality assurance mechanisms developed during this initiative are now available for global adoption, offering a template for large-scale multilingual data collection.

IndicASR: Speech Recognition Model

Using IndicVoices, the team developed IndicASR, the first automatic speech recognition (ASR) model supporting all 22 official languages listed in the 8th Schedule of the Indian Constitution. IndicASR demonstrates the transformative potential of AI in bridging linguistic divides and promoting digital inclusion.

Research and Commercial Applications

IndicVoices and IndicASR are set to catalyze advancements in AI research and development. By releasing the dataset under the permissive CC-BY-4.0 license and the tools under an MIT license, the initiative ensures widespread access for academic, research, and commercial applications.

The dataset is supported by comprehensive resources such as role-play scenarios and domain-specific question repositories, which provide invaluable assets for industries ranging from education to customer service. These innovations have far-reaching implications, enabling natural language processing (NLP) tools to understand better and serve India’s multilingual population.

Generous Support for a Visionary Goal

The project’s success owes much to the support of the Ministry of Electronics and Information Technology (MeitY) under the BHASHINI initiative and grants from Nilekani Philanthropies and the EkStep Foundation. This backing underscores the importance of public-private collaboration in achieving inclusive technological progress.

Global Impact

IndicVoices offers an open-source blueprint for creating speech datasets in other multilingual regions worldwide. Its standardization protocols, centralized tools, and engaging speech collection techniques can be adapted to capture linguistic diversity globally.

Conclusion

IndicVoices is not just a dataset but a movement toward democratizing AI for multilingual societies. By addressing the challenges of linguistic diversity, Sarvam AI, AI4Bharat, and IIT Madras have laid the groundwork for equitable AI-driven solutions. This initiative exemplifies how innovation, inclusivity, and collaboration can reshape the AI landscape, ensuring no voice is left unheard.

To explore the dataset and tools, visit IndicVoices here.

Sources: IndicVoices, Ai4Bharat

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE