Large language models have demonstrated remarkable capabilities across diverse tasks, yet their development has predominantly focused on English and other high-resource languages. This English-centric approach has created a significant technological gap for the billions of speakers of Indian languages. In an attempt to bridge this gap, Sarvam AI, one of the emerging players in India’s Generative AI landscape, developed a new language model called Sarvam-1, which has been specifically trained for Indian languages. 

What is Sarvam-1? 

Sarvam-1 is a 2-billion parameter language model specifically optimized for Indian languages. “Built from the ground up to support ten major Indian languages alongside English, Sarvam-1 demonstrates that careful curation of training data can yield superior performance even with a relatively modest parameter count”, said an official blog post. The initiative is intended to address two critical challenges in Indic language modelling: 

  • ‍Token Efficiency: Existing multilingual models exhibit high token fertility (tokens needed per word) for Indic scripts, often requiring 4 to 8 tokens per word compared to 1.4 for English. Sarvam-1’s tokenizer is claimed to achieve significantly better efficiency, with fertility rates of 1.4-2.1 across all supported languages.‍ 
  • Data Quality: While web-crawled Indic language data exists, it often lacks depth and quality. Through advanced synthetic data generation techniques, Sarvam AI has developed a high-quality training corpus of 2 trillion tokens, specifically for 10 Indic languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu). 

According to the statement, despite its compact size, Sarvam-1 demonstrates exceptional performance across standard benchmarks. “It achieves high accuracy on both knowledge and reasoning tasks, especially in Indic languages, delivering state-of-the-art performance in its class. It also punches above its weight by being competitive with much larger models in most tasks. Concretely, it easily outperforms Gemma-2-2B and Llama-3.2-3B on a variety of standard benchmarks, including MMLU, Arc-Challenge, and IndicGenBench, while achieving similar numbers to Llama 3.1 8B,” the statement added. 

These results are particularly notable given Sarvam-1’s size, which enables 4-6x faster inference compared to larger models while matching or exceeding their performance on Indic language tasks. “This combination of high performance and computational efficiency makes Sarvam-1 particularly well-suited for practical applications, including deployment on edge devices,” the firm stated.

Sources of Article

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE