Fundamentals of NLP research in Sanskrit

Pillars
IndiaAI Portal
Resources
Ecosystem
Sectors

Back

Results for ""

IndiaAI Recommends

Introduction

An era of rapidly changing technology, virtualisation, SDN/NFV, advent of 4G LTE technology and in the near future, 5G is redefining businesses across the globe. The digitalisation wave has brought artificial intelligence and machine learning, and with it a wave of organizations trying to understand natural language and intent of customers or visual clues using deep neural network techniques. We are still far away from understanding “understanding” via machines but the steps towards those have started.

While these steps have moved the needle towards realization, a large population in India seems to be unaffected by this change. The “Siris”, “Cortanas” and the “Alexas” of the world have captured a global market by their limited natural language capabilities and building huge datasets of languages in their cloud, however about 85% of Indian population seems unaffected by this change. The reason is the language used viz. “English”.

While we move towards the mid of 21st century, it is imperative that about 1/8th of the world population can converse with machines the way the rest of the world does, in its own dialect and in its own way, and this is the reason for me and my team in Makers Lab (R&D incubator of Tech Mahindra) researching upon one of the first communication languages spoken by man, “Sanskrit”

My intent here is to not to belittle any language but showcase how communications with machines can improve if alternative techniques and idioms are utilized.

Objective

Objective of this paper is to apprise the user of a research in motion which we. The vision of this research is to ensure that 23 mother tongues and 1645 dialects spoken in India are also available to every Indian so that they can seamlessly communicate with the machines of the future.

The objective is also to look through the lens of AI (artificial intelligence) and algorithms and see which one of them apply in the world of Sanskrit for NLP(natural language processing)

By definition Natural language processing (NLP) is used for communication between computers and human (natural) languages in the field of artificial intelligence, and linguistics. Being concerned with human-computer interaction, NLP works to enable computers to make sense of human language to make interactions with machinery and humans as user friendly as possible.

Fundamentals of Sanskrit

I would not try and explain Sanskrit with my limited knowledge here, but for the reader, I would supplant what Wikipedia says about Sanskrit in brevity

Sanskrit is a language of ancient India with a history going back about 3,500 years. It is the primary liturgical language of Hinduism and the predominant language of most works of Hindu philosophy as well as some of the principal texts of Buddhism and Jainism. Sanskrit, in its variants and numerous dialects, was the lingua franca of ancient and medieval India. In the early 1st millennium CE, along with Buddhism and Hinduism, Sanskrit migrated to Southeast Asia, parts of East Asia and Central Asia, emerging as a language of high culture and of local ruling elites in these regions.

Sanskrit is an Old Indo-Aryan language. As one of the oldest documented members of the Indo-European family of languages, Sanskrit holds a prominent position in Indo-European studies. It is related to Greek and Latin, as well as Hittite, Luwian, Old Avestan, and many other extinct languages with historical significance to Europe, West Asia, and Central Asia. It traces its linguistic ancestry to the Proto-Indo-Aryan language, Proto-Indo-Iranian, and the Proto-Indo-European languages.

What makes Sanskrit unique is the rule set that it formulates and the grammar that was formulated much before the language became widely accepted and spoken in the Indian sub-continent. While most languages we speak are “natural”, Sanskrit by definition is a synthetic language

Research Evidence

Most of my approach in writing this paper has been to understand various languages from grounds up (alphabets) and then compare them. In my attempt at this research, it took me days to realize a path which would take us to have an empirical evidence of our objectives, natural language processing in Sanskrit and also recreating that evidence via software.

Since this is a comparison between two languages it seems logical to start with how languages were earlier used to speak and communicate rather than write, and so let us start with some phonetics discussion

Phonetics

The phonetic sounds of alphabets in English and their counterparts in Sanskrit Varna-mala is different. Varna-mala is the Sanskrit corpus of alphabets. In English the sounds of the alphabets clash with their counterparts in quite a few occasions. Let us take the alphabets of English and the Varna-mala in Sanskrit for example

In Sanskrit, the alphabets are called Varna-mala .Every word in Sanskrit is formed because of the combination of two elements Swar (“स्वर”) and a Vyanjan (व्यंजन). In Sanskrit, there are 13 Swaras, 33 Vyanjanas and about 2 Swarakrashits {Special words} . All in all Sanskrit has a total of 49 Varnas of the Varna-mala. By its definition itself, Sanskrit has more alphabets, characters and building blocks than any other language.

Out of these Swaras 5 are pure:

Remaining 9 are: आ, , ई, ऊ,, ऋ,, ल, ए, ऐ, ओ, , औ

Well for any observer, the difference is quite apparent. The way we these are arranged is primarily because how air can be modelled within the mouth itself. When you open a mouth and take a sound from the glottis, the word sound has (अ) there, whereas when the mouth opens up wider it is an (आ) sound. English or any other language in comparison has only an (A) equivalent to ए, which completely misses the way the glottis performs when open the mouth wide.

From just observing the tables above, one this which is visible is that number of alphabets do not compare to the number of Varnas in Sanskrit. In fact they are much lesser in number, but on close examination, something interesting appears. The range of English vocabulary also becomes lesser because the phonetics of a lot of alphabets do not map to individual phonetics of the Sanskrit Varnas A map is shown below purely how phonetics is used.

Conclusion: Some glaring realities emerge:

क is the sound of two alphabets in the language both C and K

Only 18 Varnas are used to describe all 26 English alphabets

X ((क) (ज)): by its nature is a compound word and not an alphabet as phonetically the air does not blow in the mouth like this to make one alphabet sound

Z (ज): It is a compound word of ज and a period

Algotihmic structure of the language

Each word in Sanskrit can be divided into sub words and this process would continue and the division would stop when we reach a Dhatu. This is similar to algorithmic structure. Ambiguity arises while combining or dividing words could be eliminated by Fuzzy logic or Fuzzy reasoning techniques.

Solving distributional similarity as the first use case

Distributional similarity is an idea that is very common in NLP in particular. This idea has emerged out of a large number of statistical techniques used in 60s and is now accepted as the best way to finding out word meaning in a given corpus of English. The idea comes from base of linguistics which says that “words are recognized by the company they keep” by J.R. Firth

The idea is based on a technique in software called Word2Vec. Word2Vec essentially is software that typically is used for NLP in English which means converting each word to its vector representation so that the meaning of the word is clear. There are two techniques which are utilized namely the “Skip Gram Modelling” or “Continuous Bag of Words (CBOW)”. The idea is very simple; given a piece of large text in a language, Word2Vec finds out the distributional probability of a context word in relation to a centre word.

Let us explain it via an example: Let us assume there is a single paragraph from the top of the pageas shown below

“An era of rapidly changing technology, virtualisation, SDN/NFV, advent of 4G LTE technology is redefining businesses across the globe. The digitalisation wave has brought artificial intelligence, natural language processing, analytics, and big data into the foray making it more possible for the machines to emulate humans.”

In the word2vec technique, the software runs through an unsupervised fashion where each word is chosen and a window is chosen around the word (m) and probability of the words around the centre word is found. The net resultant is a vector representation called “embedding” that emerge

An era of rapidly changing technology, virtualisation]

{m-2} {Centre word} {m+2}

If you notice, the window size (m) is what we decide and this window size decides based on a sliding window principle that if “technology” is the centre word, two words before it and two words after it would model the distribution of technology in the corpus.

This idea is very powerful as a magic happens when this is run. Words and their vectors get formatted in a space which has a meaning as shown below. This meaning is provided via the vectors. So a man vector – the king vector + woman vector yields a “queen” automatically.

One has to ask why this technique was used. It is very simple… However on deep diving, in English similar words represent different contexts .let us take an example of glasses.

Let us take some sentences to represent this

“I have glasses. “

The stress on this sentence is on the object which is “glasses”. The most obvious question here is what glasses are we talking about, pair of glasses used as a pair of lenses to correct vision or the glass tumblers …. ? Now in Sanskrit, these two words by their construction are different

Glass: दर {noun masculine} which means A smooth surface, usually made of glass with

Reflective material painted on the underside that reflects light so as to give an image of what is in front of it.

Spectacles: उपन {neuter} a pair of lenses in a frame that are worn in front of the eyes and are used to correct faulty vision or protect the eyes.

By their construction words are different to represent different meaning. So the ideology that words are island and do not convey any meaning on their own is not really valid for Sanskrit based NLP. Words by themselves represent the meaning clearly. This is useful from a commercial standpoint to a large variety of FAQ based chatbots which are being used today. This is however just the beginning phase of the research

Research down the line

With our initial research at Maker’s Lab Tech Mahindra, the results obtained are positive. With Sanskrit’s algorithmic base, the word2vec layer of converting words to vectors can be eliminated. We plan to extend the research (not part of this paper) to the following areas which the team is actively involved in

Sequence learning by finding an alternative route other than recurrent neural networks in Sanskrit. Sequences in a language are formed based on grammatical rules and from a machine’s perspective, a model of RNN is utilized to train the machine with a large corpus on how the words have been used in the language

Auto-Finding Paninis’ rule: Panini’s (the grammarian who developed the grammar for Sanskrit) gave a set of ~ 4000 rules so that the language is well formed. However, construction of these rules is very difficult and very few in the world know about these rules. Our approach is to use a deep reinforcement learning technique to enable a machine auto-discover those rules. This would enable us to provide intelligence at the core to the machine about formation of languages and tasks like natural language understanding and generation would become trivial for the machine in the environment the machine is placed in

Post tagging for unknown words for Hindi and Marathi

Before we get onto Sanskrit, I wish to focus attention of another research being done at Makers lab for languages to figure out unknown POS (parts of speech tags) in languages. This paper is presented in IEEE as well.

POS Tagging is a bit of programming that peruses a message in a language and extracts grammatical features out of each word, for example thing, action word, descriptor and so on. POS-Tagger forms grouping of words and joins a grammatical feature tag to each word. In POS tagging the area of focus is the relationship between adjacent words in a sentence or phrase.

The prediction of unknown words for Hindi and Marathi Language is mostly similar because of its structure of sentence, grammar etc. They are comparative in the light of the fact that their root language is same (Sanskrit). Both Hindi and Marathi are written using Devanagari script and considered as morphologically rich languages.

The main aim of this research is to predict the POS-tag of a word unknown to the trained model. This is accomplished by applying Naïve Bayes Algorithm and predicting the most likely tag for the unknown word. Our technical contribution to that research can be summarized as follows:

We have presented existing available POS tagging techniques and categorized it such as rule based, AI and heuristic.

We have presented a table containing all the Parts of Speech tags for NLTK’s Hindi and Marathi corpus along with their meaning.

We have proposed a fairly simple but effective approach for the prediction of POS-tag of an Unknown word using Naïve Bayes Algorithm. This attempt is highly suited for POS-tagging of languages that have a very limited training corpus.

Conclusion

Human beings evolved at a rapid pace because of the way they could communicate with each other and pass on ideas and messages. One of the oldest well-formed language is now being relegated to scriptures. Our intent is to ensure this language becomes the core of understanding machines and also relaying information not just for a wide variety of native population but for the world.

IndiaAI Recommends