AI4Bharat, in partnership with IBM Research India, has introduced MILU (Multi-task Indic Language Understanding Benchmark), an extensive new evaluation benchmark for Indic languages. AI4Bharat started as a collaboration between IIT Madras and Nandan Nilkeni’s ekStep foundation.

Evaluating Large Language Models (LLMs) in low-resource and linguistically diverse languages remains a significant challenge in NLP, particularly for languages using non-Latin scripts like those spoken in India. Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing LLM capabilities in these languages. MILU is a comprehensive evaluation benchmark designed to address this gap. 

Meet MILU

MILU spans eight domains and 42 subjects across 11 Indic languages, reflecting general and culturally specific knowledge. With an India-centric design, MILU incorporates material from regional and state-level examinations, covering local history, arts, festivals, and laws alongside standard subjects like science and mathematics. 

The team evaluated over 45 LLMs and found that current LLMs struggle with MILU, with GPT-4o achieving the highest average accuracy at 72%. Open multiline-goal models outperform language-specific fine-tuned models, which perform only slightly better than random baselines. Models also perform better in high-resource languages than low-resource ones. Domain-wise analysis dictates that models perform poorly in culturally relevant areas like Arts & Humanities and Law & Governance compared to general fields like STEM. The team has stated that all code, benchmarks, and artefacts will be publicly available to foster open research.

Analysis of the model

The team's analysis shows that tested models perform significantly better in high-resource languages than low-resource ones, highlighting the need for more robust multilingual strategies. Additionally, the domain-specific analysis indicates that models perform better in general fields such as STEM while facing challenges in culturally relevant subjects like Arts, Humanities, and Law, highlighting the lack of this knowledge in the current models and datasets.

As LLMs continue to become pivotal in modern applications, the team hopes that MILU offers a foundational benchmark for developing more inclusive, culturally aware models that perform well across both general and culturally relevant domains.

This work has a few limitations. The study is restricted to the top 11 languages due to the lack of readily available questions in low-resource languages, which is aimed to be addressed in future work. Limited computational resources also prevented a thorough evaluation of larger models, such as LLAMA-3.1-70B and LLAMA-3.1-405B. Third, the scarcity of questions necessitated translating a portion of the dataset. 

Finally, the evaluation primarily relies on the log-likelihood approach, which may yield different results compared to other established methods, such as generation-based evaluation and chain-of-thought (CoT) prompting.

Sources of Article

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE