It is estimated that 75% of internet users in India communicate with the internet in Indian languages. Here, the problem statement is about enabling the next billion users to access Google Maps through the language of their choice. To add, most Indian places of interest (PoI) aren’t available in native scripts. 

In a recent blog, Google engineers (Cibu Johny, Software Engineer, Google Research, and Saumya Dalal, Product Manager, Google Geo) explained the challenge faced by users. They gave the example of a user who wants to search for KD Hospital in Gujarati script (the sixth-most widely spoken Indian language). Google Maps easily understands “hospital” but it doesn’t comprehend that “Kay Dee” written in Gujarati script is the English equivalent of KD, as easily. The search results are ambiguous and often the user is shown results that are further away from the one being sought.    

Wiki: Transliteration is a type of conversion of a text from one script to another that involves swapping letters. It is not primarily concerned with representing the sounds of the original but rather with representing the characters, accurately and unambiguously. 

This is the idea that Google builds on. They have built an ensemble of learned models to transliterate names of Latin script PoIs into 10 prominent Indian languages: Hindi, Bangla, Marathi, Telugu, Tamil, Gujarati, Kannada, Malayalam, Punjabi, and Odia. It has increased the PoI coverage nearly twenty-fold in some languages to include doctors, hospitals, grocery stores, banks, bus stops, train stations, and other essential services. They want to design a system that will transliterate from a reference Latin script name into the scripts and orthographies native to the above-mentioned languages. The point to note here is the difference between translation & transliteration where the latter is much like writing the same words in a different script. Variability in spelling in Latin script makes it difficult to capture the exact transliteration of many words in Indian languages. 

The writers explain, “Candidate transliterations are derived from a pair of sequence-to-sequence (seq2seq) models. One is a finite-state model for general text transliteration, trained like models used by Gboard on-device for transliteration keyboards. The other is a neural long short-term memory (LSTM) model trained, in part, on the publicly released Dakshina dataset. This dataset contains Latin and native script data drawn from Wikipedia in 12 South Asian languages, including all but one of the languages mentioned above. For each native language script, the ensemble makes use of specialized romanization dictionaries of varying provenance that are tailored for place names, proper names, or common words.”  

The ensemble is a dynamic model and continues to assign weights to all possible transliterations based on the frequency of occurrence in a very large online text. The parameters are drawn specifically keeping in mind the utility – identification of PoI. The technology is still in the process of being perfected but the early results are most promising and it marks a substantial expansion of Google Maps usage in India.  

Sources of Article

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE