Results for ""
In a groundbreaking collaboration, IIT Madras, AI4Bharat, and Sarvam AI have unveiled IndicVoices, India's first comprehensive speech dataset.
Designed to represent India's vast linguistic, cultural, and demographic diversity, IndicVoices marks a significant milestone in advancing speech recognition and artificial intelligence (AI) in multilingual contexts. With 12,000 hours of speech data from 16,237 speakers across 208 Indian districts and 22 languages, the initiative is a testament to India's technological and collaborative prowess.
IndicVoices is a dataset of natural and spontaneous speech containing a total of 12000 hours of read (8%), extempore (76%) and conversational (15%) audio from 22563 speakers covering 208 Indian districts and 22 languages. Of these 12000 hours, 3200 hours have been transcribed, with a median of 122 hours per language. The initiative has already transcribed 1,639 hours of this dataset, with a median of 73 hours per language, ensuring equitable representation.
The project employed a robust, scalable framework involving 1,893 personnel and cutting-edge tools such as:
The open-source protocols, transcription guidelines, and quality assurance mechanisms developed during this initiative are now available for global adoption, offering a template for large-scale multilingual data collection.
Using IndicVoices, the team developed IndicASR, the first automatic speech recognition (ASR) model supporting all 22 official languages listed in the 8th Schedule of the Indian Constitution. IndicASR demonstrates the transformative potential of AI in bridging linguistic divides and promoting digital inclusion.
IndicVoices and IndicASR are set to catalyze advancements in AI research and development. By releasing the dataset under the permissive CC-BY-4.0 license and the tools under an MIT license, the initiative ensures widespread access for academic, research, and commercial applications.
The dataset is supported by comprehensive resources such as role-play scenarios and domain-specific question repositories, which provide invaluable assets for industries ranging from education to customer service. These innovations have far-reaching implications, enabling natural language processing (NLP) tools to understand better and serve India’s multilingual population.
The project’s success owes much to the support of the Ministry of Electronics and Information Technology (MeitY) under the BHASHINI initiative and grants from Nilekani Philanthropies and the EkStep Foundation. This backing underscores the importance of public-private collaboration in achieving inclusive technological progress.
IndicVoices offers an open-source blueprint for creating speech datasets in other multilingual regions worldwide. Its standardization protocols, centralized tools, and engaging speech collection techniques can be adapted to capture linguistic diversity globally.
IndicVoices is not just a dataset but a movement toward democratizing AI for multilingual societies. By addressing the challenges of linguistic diversity, Sarvam AI, AI4Bharat, and IIT Madras have laid the groundwork for equitable AI-driven solutions. This initiative exemplifies how innovation, inclusivity, and collaboration can reshape the AI landscape, ensuring no voice is left unheard.
To explore the dataset and tools, visit IndicVoices here.
Sources: IndicVoices, Ai4Bharat
Image source: Unsplash