As the Founder of CognitiveLab, Aditya S Kolavi specialises in developing practical solutions powered by generative AI Technology. He is also an AI researcher. CognitiveLab's mission is to leverage the latest advancements in AI to build innovative applications that solve real-world problems. He has experience developing generative AI models and deploying them in a scalable manner on cloud platforms like Azure and AWS. He recently developed an LLM leaderboard to standardise the evaluation process for Indic LLMs.

Why are quality datasets important in the development of AI models?

Quality datasets are the most critical factor in the development of AI models, significantly impacting the final model's performance. This is especially true for large language models (LLMs). It is a well-established fact that better data leads to better model performance. Therefore, when training Indic LLMs, the primary focus should be curating or generating high-quality datasets. Good datasets will ultimately benefit the model's output.

Do you think India needs more datasets in Indic languages?

Yes, India needs more datasets in Indic languages. Currently, Indic languages face a significant data scarcity problem. Without addressing this issue, training models could result in wasted computational resources. Thus, there is a dire need for good quality Indic datasets for both training and evaluation to ensure the models' effectiveness and utility.

Can you tell us about your work with AI and what made you interested in the field?

I primarily work on building scalable solutions around generative AI models, such as LLMs, for various use cases. My interest in the field stems from my passion for open source, and I strive to open source as much of my work as possible to benefit the community. Some of my open-source projects include: 1. Released two open-source Indic Large Language Models: - Ambari: India's first bilingual Kannada-English LLM built on top of Meta LLama2. It was one of its kind when it was released. - Blog post: Introducing Ambari - News article: CognitiveLab Unveils Ambari, Bilingual Language Models in Kannada-English - Project link: Ambari on Hugging Face - Gaja: Hinglish LLM trained on the latest LLama3, outperforming LLama3 itself on many benchmarks. - Gaja on Hugging Face 2. Developed Indic_eval, a lightweight evaluation framework for assessing the performance of Indic LLMs on various benchmarks. Indic_eval on GitHub 3. Created an LLM leaderboard using Indic_eval, allowing model builders to evaluate and upload scores for comparative analysis. Indic LLM Leaderboard on Hugging Face 4. Tokenizer Arena is a platform that allows users to easily compare different LLM tokenisers. Tokenizer Arena on Hugging Face

Can you explain the importance of the Indic LLM leaderboard you developed? What makes it relevant in today's AI ecosystem?

The Indic LLM leaderboard, developed alongside the Indic_eval library, is an open-source project aimed at standardising the evaluation process for Indic LLMs. It provides a level playing field to compare these models qualitatively, ensuring that performance assessments are fair and transparent. This relevance is particularly crucial in today's AI ecosystem, where the quality and performance of language models can vary significantly. It also acts as a foundational layer on top of which other researchers and developers can build.

How can we promote and support the development of more India-centric datasets and AI models?

To promote and support the development of more India-centric datasets and AI models, I believe providing more resources, such as GPUs, for researchers to experiment with is essential. Additionally, encouraging more open-source efforts, especially crowd-sourcing datasets, would be beneficial. AI4Bharat is doing an excellent job in this area, and similar initiatives should be supported and expanded.

Are there any similar projects in the pipeline?

Yes, we are experimenting with different architectures to optimise for inference across various use cases. We are also working on visual language models and creating data pipelines to generate synthetic datasets for English and Indic languages.

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE