Vernacular language models are language models that are specifically trained on non-standard variations of a language. These models capture the informal spoken language commonly used by distinct cultures. 

These models seek to encompass everyday speech's many linguistic subtleties, vocabulary, and grammar, which can differ significantly from standardized versions. By prioritizing vernacular languages, these models enhance the precision and pertinence of natural language processing tasks for those who speak these dialects. 

Additionally, they foster inclusivity and contribute to the preservation of cultural diversity. They have a vital function in connecting technology and linguistic variety, helping communities to communicate proficiently in their language and enabling greater involvement in the digital realm.

The following are some of the six attractive vernacular language models in 2024

Tamil Llama

Abhinand Balachandra, a Kaggle ML Engineer and Kaggle Master has recently developed Tamil Llama, an Indic LLM designed exclusively to enhance the Tamil language domain. This AI model is constructed using Meta's Llama 2 as its foundation.

The model underwent training with an additional 16,000 Tamil tokens to enhance its capabilities in generating and understanding text in the Tamil language. This model is an expansion of the LLaMA model and has been improved by including additional Tamil tokens and employing the LoRA methodology to enhance training efficiency.

Four versions are available: Tamil LLaMA 7B, 13B, 7B Instruct, and 14B Instruct. During the training phase, the model's vocabulary was augmented to include 16,000 Tamil tokens and 32,000 tokens.

Telugu Llama

Ramsri Goutham Golla, an employee of Segmind.com, is actively involved in the development of Telugu Llama and thoroughly instructed in LLama 2, a widely used language model, demonstrating exceptional proficiency in counting tokens for Telugu text.

According to Golla, the Telugu language utilizes less tokens than English. This advancement guarantees quicker and more economical text creation in Telugu and establishes the foundation for LLama 2 to outperform Indic languages, transforming the field of natural language processing. 

Odia Generative AI

The model odia_llama2_7B_v1, developed by Odia Generative AI, is built upon the Llama2-7b architecture and has been refined using a 180k Odia instruction set. This set comprises translated data from open-source sites and a meticulously designed domain knowledge instruction set. The outcome is a proficient model that comprehends Odia instructions and produces responses, showcasing its practical usefulness for the intricacies of the Odia language.

SeaLLMs – Large Language Models for Southeast Asia

Alibaba Group Holding's research division, Damo Academy, has unveiled LLMs tailored explicitly to Southeast Asian languages. SeaLLMs are constructed based on the Llama-2 model and enhanced through ongoing pre-training using an expanded lexicon, customized guidance, and alignment tuning to capture the complexities of regional languages more effectively.

The Southeast Asia LLM (SeaLLM) underwent pre-training using Vietnamese, Indonesian, Thai, Malay, Khmer, Lao, Tagalog, and Burmese datasets. It has demonstrated superior performance in linguistic and safety tasks to other open-source models. SeaLLMs demonstrate exceptional aptitude in tasks involving comprehension and production of language, posing a significant challenge to prevailing models such as ChatGPT-3.5, particularly in Southeast Asian (SEA) languages.

OpenHathi

Sarvam AI has recently launched OpenHathi-Hi-v0.1, the first Hindi LLM in the OpenHathi series in India. This model, derived from Llama2-7B, performs similarly to GPT-3.5 for Indic languages. It is constructed on a cost-effective platform.

OpenHathi utilizes an extended version of Llama2-7B's tokenizer, which consists of 48,000 tokens. The training method of OpenHathi involves two distinct phases. The first stage requires alignment by aligning randomly initialized Hindi embeddings. It is followed by bilingual language modelling, which trains the model to have cross-lingual attention across tokens.

The model exhibits strong and consistent performance in different Hindi tasks, which is on par with, if not superior to, GPT-3.5 while maintaining competency in English. Sarvam AI's assessment encompasses practical, real-life assignments and conventional Natural Language Generation (NLG) activities. Comparisons between GPT-3.5 and GPT-4, with GPT-4 as the evaluator, demonstrated exceptional proficiency in Hindi in its native script and Romanized form.  

Sources of Article

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE