Challenges for NLP in Indian diversity

Pillars
IndiaAI Portal
Resources
Ecosystem
Sectors

Back

Results for ""

IndiaAI Recommends

We find pride in calling ourselves a diverse nation, with various religions, communities, and languages present in India. The 2011 census showed that we have around 121 major languages and about 1599 other languages.

We are a fast-growing economy with the second largest population in the world. This attracts the attention and business interests of many global tech giants, IT & software companies, and major multinationals in fashion or consumer goods.

"Our planet is blessed with several languages. In India, we have several languages and dialects; such diversity makes us a better society. As professor Raj Reddy suggested, why not use AI to seamlessly breach the language barrier?" said PM Modi during his speech on the opening of the Responsible AI for Social Empowerment (RAISE), 2020 summit.

Though we have English as our official language, the challenge is that only 1% of the Indian population speaks this language. So the majority of the Indian population is Non-English speaking with languages such as Bengali, Gujarati, Hindi, Kannada, Marathi, Malayalam, Odia, Punjabi, Telugu, Tamil, or Urdu.

This is a clear and most evident challenge that NLP has to deal with in India. The most fundamental task that NLP is set to perform for us is understanding and translating human language into a form that can be processed by machines. When we say manipulating human language, we include various sub-tasks such as understanding words, phrases, idioms, proverbs, connotations, and the basics of that language.

This definitely is all about extensive knowledge and understanding to be able to interpret it. India specifically has few added challenges for NLP owing to the fact that Indic languages do not use Latin alphabets but alphabets derived from Brahmic scripts.

Let's discuss in detail what challenges are we talking about here.

Text based applications need a very clear understanding of the language to achieve the closest and the most accurate translations. This linguistic understanding includes speech tagging, lemmatization, phrase extraction, text categorization, entity extraction, topic extraction and parsing.

Ambiguity and Complexity at various levels - We often have encountered a lot of complexity while learning language basics. Many confusing aspects that took us long and required practice to hone that skill. This is a challenge for NLP as well when it comes to managing those ambiguities in terms of words, semantics, syntax, phonology, pronunciation, tonality, emotions, and more. We need NLP algorithms to tune themselves to achieve expertise in overcoming these barriers.
Attuning as per Emotions - Many times, it is critical that the algorithms understand the emotional words and adapt accordingly. This will require a special focus on special keywords to recognize and include emotion sensitive words in its database.
Including and understanding idioms and metaphors - The relevant usage of idioms, phrases, and metaphors makes the translation much more accurate and sets the right tone. This will come in pretty handy for virtual voice assistance.
Anaphora and Cataphora Problem - Anaphora and cataphora problem is another barrier in text translation. This issue arises when, during a conversation, we replace the subject with a pronoun or respective synonyms. So the ambiguity that arises for the algorithm to or figure out which pronoun refers to which subject in a complex conversation. This is a grammatical problem that we all some time or other encounter in our speaking or writing.
Lack of Proper Documentation - We can say lack of standard documentation is a barrier for NLP algorithms. However, even the presence of many different aspects and versions of style guides or rule books of the language cause lot of ambiguity. This distorts the standardization and makes it difficult to achieve superior quality. The presence of various styles of writing and speaking also adds to the difficulty levels for NLP. Various grammatical standards are followed in different regions, or mostly, it's open for people to chose their favorite versions. These, however, change the word orders and ways of writing and speaking.

NLP attempts to overcome all these above-mentioned issues to provide high-quality speech to speech or text to speech conversions. So the aim is to develop systems that already have been programmed, keeping in mind these issues. Models need to be created where location specific aspects can be guessed, and hence the system can adapt accordingly.

We must try sharing data sets on a larger level to be able to compare various styles and approaches across a langue or among multiple languages. Another critical aspect that should be adopted is to standardize and provide a structure to the writings and make sure the new information documentations follow strict guidelines

Want to publish your content?

Publish an article and share your insights to the world.

IndiaAI Recommends