Results for ""
AI4Bharat has announced the launch of BhasaAnuvaad, a speech translation dataset tailored for Indian languages. It boasts coverage across 13 languages and approximately 44,400 hours of audio.
Automatic Speech Translation (AST) datasets for Indian languages remain critically scarce, with public resources covering fewer than 10 of the 22 official languages. This scarcity has resulted in AST systems for Indian languages lagging far behind those available for high-resource languages like English. In a paper showcasing the model's performance, the researchers first evaluate the performance of widely used AST systems in Indian languages, identifying notable performance gaps and challenges.
Their findings show that while these systems perform adequately on read speech, they struggle significantly with spontaneous speech, including disfluencies like pauses and hesitations. Additionally, there is a striking absence of systems capable of accurately translating colloquial and informal language, an essential aspect of everyday communication.
BhasaAnuvaad is the largest publicly available dataset for AST, involving 13 out of 22 scheduled Indian languages and English, and it spans over 44,400 hours and 17M text segments.
BhasaAnuvaad contains data for English speech to Indic text and Indic speech to English text. This dataset comprises three key categories: (1) Curated datasets from existing resources, (2) Large-scale web mining, and (3) Synthetic data generation.
By offering this diverse and expansive dataset, the researchers aim to bridge the resource gap and promote advancements in AST for Indian languages.
The work evaluated popular Spoken Translation systems for Indian languages and identified key limitations in their ability to handle spontaneous speech, a common occurrence in real-world deployment scenarios. The findings underscore the need for more rigorous evaluation and benchmarking of these systems in practical settings.
The researchers introduced a novel benchmark, INDIC-SPONTANEOUS-SYNTH, comprising 120 hours of spontaneous speech data across 13 Indian languages and English, covering diverse domains and tasks to provide a more realistic testbed for system performance.
Source:
Full Study