The language barrier is one of the major challenges in India. Most of the non-English and non-Hindi-speaking population in India uses languages such as Bengali, Gujarati, Hindi, Kannada, Marathi, Malayalam, Odia, Punjabi, Telugu, Tamil, Urdu, etc.

To tackle this challenge and ensure the effective deployment of AI models in India, Soket AI Labs has released the "Bhasha" series, commencing with two significant datasets: "bhasha-wiki" and "bhasha-wiki-indic." These datasets support the development of AI models that are attuned to India's linguistic and cultural nuances. They represent a crucial step forward in the diversification of linguistic resources in computational linguistics. 

The launched datasets are available in open-source format. The company aim to foster a collaborative environment where developers and researchers across India can contribute to and benefit from inclusive and contextually aware AI technology. 

Knowing bhasha-wiki

This dataset presents a comprehensive corpus of 44.1 million Wikipedia articles translated into six major Indian languages from 6.3 million English articles. It includes over 45.1 billion Indic tokens, serving as a foundational resource for linguistic and AI research and facilitating a wide range of studies into machine translation, NLP, and language model training.

The dataset has the following characteristics:

  • Extensive Lexical Volume with a total size of 117 GiB.
  • Ensure linguistic diversity as the dataset supports a multilingual framework that includes Hindi, Gujarati, Urdu, Tamil, Kannada, Bengali, and English.
  • Translation methodologies utilising IndicTrans2.

This image shows the characters, words and token distribution for each language.

Image source: Soket AI Labs Blog

Knowing bhasha-wiki-indic

The "bhasha-wiki-indic" dataset is a refined subset of the "bhasha-wiki." It is specifically curated to enrich models with a deeper understanding of the Indian context. Content with significant relevance to India was meticulously selected. This dataset enhances the potential for developing culturally resonant AI applications. 

This dataset identified approximately 208,000 contextually relevant articles that were subsequently extracted from six Indian languages. It comprises 200,820 rows with nearly 1.54 billion tokens. It provides a rich linguistic base for detailed computational analysis.

Future directions

The Soket labs expect these datasets to significantly impact computational linguistics and AI research by providing high-quality, large-scale resources for training models that require a nuanced understanding of Indian languages and contexts. These datasets facilitate academic and commercial use by promoting wide dissemination and application in diverse settings. They also serve as a space for further scholarly inquiry into cultural specificity in AI technologies. 

Sources of Article

Soket Labs Blog

Image: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE