Results for ""
Google released “XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation” to encourage more research on multilingual learning to capture more data for natural language processing (NLP) and be useful for running many tasks. According to a blog post by Google, Xtreme covers 40 diverse languages and includes nine tasks that collectively require reasoning about different levels of syntax or semantics.
The languages were chosen for Xtreme to maximise diversity along with their coverage of existing tasks as well as the availability of training data. Among them are under-studied languages such as the Dravidian languages Tamil, Telugu, and Malayalam, spoken mainly in southern India, as well as the Niger-Congo languages Swahili and Yoruba, spoken in Africa.
The nine tasks of Xtreme covers a range of paradigms, including sentence classification (i.e., assigning a sentence to one or more classes) and structured prediction (predicting objects like entities and parts of speech), in addition to things like sentence retrieval (matching a query against a set of records) and efficient question-answering. For models to be successfully tested on Xtreme, they must be pre-trained on multilingual text using objectives that encourage cross-lingual learning and be fine-tuned on task-specific English data. XTREME then evaluates these models on zero-shot cross-lingual transfer performance, i.e., on other languages for which no task-specific data was seen. The three-step process, from pre-training to fine-tuning to zero-shot transfer, is shown in the figure below. For tasks where labelled data is available in other languages, Xtreme also compares against fine-tuning on in-language data and provides a combined score by obtaining the zero-shot scores on all tasks.
The google researchers conduct experiments with several state-of-the-art pre-trained multilingual models such as multilingual BERT, a multilingual extension of the popular BERT model; XLM and XLM-R, two larger versions of multilingual BERT that have been trained on even more data; and a massively multilingual machine translation model, M4. All these models have been pre-trained on large amounts of data on around 100 languages, including the 40 languages of Xtreme’s benchmark. While models performed fairly well in the Indo-European language family, they achieve lower performance on many languages from other language families such as Sino-Tibetan, Japonic, Koreanic, and Niger-Congo languages.
“We find that while models achieve close to human performance on most existing tasks in English, performance is significantly lower for many of the other languages,” wrote Google Research senior software engineer Melvin Johnson and DeepMind scientist Sebastian Ruder in a blog post. “Overall, a large gap between performance in English and other languages remains across all models and settings, which indicates that there is much potential for research on cross-lingual transfer.”
The research paper is published on arXiv.org, a free online resource maintained and operated by Cornell University. The code and data for the Xtreme benchmark are available on GitHub, along with examples for running various baselines. A website and instructions for submitting results to a leaderboard are forthcoming.