Researchers at the Indian Institute of Technology Gandhinagar (IITGN) have made significant strides in artificial intelligence and natural language processing by introducing Ganga-1B, a large language model (LLM) designed explicitly for the Hindi language. Developed by the Lingo Research Group, this AI model is the inaugural product of Project Unity, an initiative dedicated to celebrating and leveraging India’s linguistic diversity by creating comprehensive resources for the country’s major languages.

Development of Ganga-1B

The creation of Ganga-1B represents the culmination of nearly 1.5 years of intensive research and development. The model has been meticulously trained on an extensive collection of public-domain Hindi data, which includes news articles, web documents, books, government publications, educational materials, and selected social media conversations. To ensure the dataset's quality and authenticity, native Hindi speakers reviewed the data.

The training utilized open-source data from various websites, resulting in a robust and versatile language model. The primary goal was to create a model that understands and processes Hindi language data effectively and outperforms existing open-source models for Indian languages, including those with parameter sizes up to 7 billion.

Key Features

  • Developed by Lingo Research Group at IIT Gandhinagar
  • Model Type: Autoregressive Language Model
  • Languages: Bilingual (Primary: Hindi [hi], Secondary: English [en])
  • License: Apache 2.0

Technical Specifications

  • Precision: Float32
  • Context Length: 2,048
  • Learning Rate: 4e-4
  • Optimizer: AdamW
  • LR Scheduler: Cosine

Model Architecture and Objective: Ganga-1B is a decoder-only transformer model with the following specifications:

  • Layers: 16
  • Attention Heads: 32
  • Embedding Dimension: 2,048
  • Vocabulary Size: 30,000
  • Sliding Window: 512
  • Intermediate Dimension: 7,168

Performance and Impact

Ganga-1B has demonstrated superior performance to existing models, making it a significant advancement in AI language technology for Indian languages. The model’s ability to outperform larger models underscores the effectiveness of the training data and methodologies employed by the researchers.

Since its release, Ganga-1B has garnered considerable attention, with over 600 downloads within the first 48 hours. This rapid adoption highlights the demand for high-quality language models in Hindi and the broader Indian language ecosystem.

Broader Applications and Future Developments

The success of Ganga-1B is just the beginning of the Lingo Research Group at IITGN. The team is developing similar AI models for other Indian languages, including Tamil, Telugu, Marathi, Gujarati, and Urdu. These efforts are part of a broader vision to enhance linguistic resources across India’s diverse language landscape.

E-Governance and Education

One promising application of these AI models is e-governance. By creating models that can effectively process and understand regional languages, the researchers aim to improve accessibility and efficiency in government services. It can lead to better communication and service delivery, especially in regions where local languages are predominantly spoken.

The team works on an educational language model (LLM) to assist students and teachers in the education sector. This model aims to provide educational resources, support language learning, and enhance the educational experience by leveraging AI to offer personalized and contextually relevant content.

Project Unity: Celebrating Linguistic Diversity

Project Unity, under which Ganga-1B was developed, aims to create a unified and comprehensive linguistic resource for India’s major languages. This initiative seeks to advance AI technology and celebrate and preserve India’s rich linguistic heritage. By developing state-of-the-art language models for various Indian languages, Project Unity aspires to bridge linguistic divides and promote inclusivity in technology and communication.

Conclusion

The unveiling of Ganga-1B by IIT Gandhinagar marks a significant milestone in AI and natural language processing for Indian languages. Through dedicated research and innovative methodologies, the Lingo Research Group has created a model that sets new standards for language models in Hindi. With ongoing efforts to expand this technology to other languages and explore its applications in e-governance and education, the future looks promising for AI-driven linguistic advancements in India. Project Unity’s vision of celebrating and harnessing India’s linguistic diversity is well on its way to becoming a reality, with Ganga-1B paving the way for more breakthroughs.

Source: PIB

Source: Mayank

Image source: IIT Gandhinagar

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in