Results for ""
AI4Bharat unveils BhasaAnuvaad, the largest speech translation dataset tailored for Indian languages. This initiative supports India's enormous linguistic diversity by bridging critical gaps in speech translation benchmarks, encompassing 44,400 hours of audio across 13 Indian languages.
The dataset encompasses Hindi, Bengali, Tamil, Telugu, Malayalam, Kannada, Gujarati, Marathi, Odia, Punjabi, Urdu, Assamese, and Nepali, drawing from a combination of public resources, large-scale web scraping, and synthetic data generation. This multifaceted approach ensures diversity and volume and addresses unique India-specific challenges like code-switching and dialectical variations, often overlooked in global datasets.
AI4Bharat has also unveiled Indic-Spontaneous-Synth, a synthetic evaluation set crafted to expose the limitations of existing translation models in handling spontaneous and naturalistic speech. While models trained on datasets like FLEURS demonstrate high performance in controlled settings, they often falter in realistic scenarios characterized by spontaneous speech, contextual nuance, and regional intricacies. Indic-spontaneous-synth is a benchmark to push the boundaries of model robustness and adaptability in practical applications.
The roadmap for Indic-Spontaneous-Synth includes releasing a human-edited version, promising even higher-quality benchmarks for future research and development.
The introduction of BhasaAnuvaad is a transformative milestone in Indian AI, as it significantly elevates the resources available for speech translation and natural language processing (NLP) in Indic languages. The dataset’s size, diversity, and focus on linguistic intricacies make it an invaluable resource for researchers and developers building inclusive AI systems that cater to India’s linguistic mosaic.
Furthermore, AI4 Bharat’s vision extends beyond this launch. Plans are underway to expand the dataset, incorporate more languages, and develop a dedicated speech translation model optimized for Indian contexts.
AI4Bharat's collaboration with IBM Research India under The AI Alliance has already resulted in breakthroughs like MILU (Multi-task Indic Language Understanding Benchmark). With 85,000 multiple-choice questions spanning 11 Indian languages and eight domains, MILU exemplifies their focus on creating India-specific AI benchmarks integrating general knowledge and cultural depth.
The release of BhasaAnuvaad signals an exciting era for Indian AI research. It invites academic institutions, startups, and tech giants to collaborate in harnessing this resource for innovative applications. The possibilities are immense, from enabling seamless multilingual communication to powering AI-driven public services.
Researchers can visit the GitHub repository to access the BhasaAnuvaad dataset and Indic-Spontaneous-Synth.
With initiatives like these, AI4Bharat is advancing AI technology and democratizing access to tools and datasets that empower the global AI community to embrace linguistic and cultural diversity.
Sources: IndiaAI Dataset platform, BhasaAnuvaad
Image source: Unsplash