Making computer programs capable of understanding human languages has been considered a significant obstacle in creating truly intelligent machines, despite the recent advancements in natural language processing (NLP). The potential applications of a machine that can completely grasp human vernacular are almost endless. 

A major breakthrough in NLP everyone has been waiting for came in October 2018, when Google’s then newly released NLP model called BERT (Bidirectional Encoder Representations from Transformers) passed the benchmark English reading-comprehension test designed by Allen Institute of AI with a human-equivalent score. Experts hailed BERT as a major landmark and a tectonic shift in how we create NLP models. However, its huge size and heavy computing requirements remained a major question mark, as it required the electricity equivalent to power a US household for 50 days just to train the larger version of the model.

On September 2019, two groups of researchers have shrunk the BERT successfully, which could finally make this advanced technology more accessible to researchers and developers across the globe. Furthermore, these tiny models will be capable of even working from a consumer electronic device such as smartphones.

So, what is BERT?

Despite the major leaps in natural language processing (NLP) in last two decades, a computer’s ability to grasp human languages has been far from remarkable when you compare with the advancements made in other artificial intelligence (AI) technologies such as computer vision. BERT’s arrival marked a significant evolution in AI and proved that it could learn the vagaries of languages and apply what it has learned to a variety of specific tasks.

BERT was hailed as “a step toward a lot of still-faraway goals in AI, like technologies that can summarise and synthesise big, messy collections of information to help people make important decisions,” by Sam Bowman, a New York University professor who specialises in natural language research". BERT was succeeded by numerous competitive NLP frameworks such as Open AI’s GPT-2, and NVIDIA’s MegatronLM, which have produced better results at the cost of significantly high computing resources.  

The major innovation behind BERT was applying the bidirectional training of a Transformer - an attention model, to language modelling. Before BERT, NLP models such as ELMo and Open AI’s GPT used unidirectional language models on Transformers to learn general language representations. But in the case of BERT, a Transformer encoder can read the entire sequence of words simultaneously, enabling the model to understand the context of a word based on other surrounding words. As a result, “the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task specific architecture modifications,” writes the researchers led by Jacob Devlin in the paper".

BERT’s success happened because it leaned on enormous amounts of computer processing power that was not available to neural networks previously. Many AI researchers point out that “the ideas that drive BERT have been around for years, but they started to work because modern hardware could juggle much larger amounts of data”.Moreover, AI researchers have grown concerned about the energy and computing power requirement of these new NLP models. There was also the fear of AI research concentrating into the hands of a few tech companies. That means under-resourced labs, startups, and academic institutions, especially in the developing economies, won’t be able to have the means to use or develop such computationally expensive models.

Shrinking the BERT

In response to these concerns, many researchers have been working on shrinking the size of existing BERT models without losing their capabilities. And finally, on September 2019, two groups of researchers- from Huawei and Google, published papers explaining how they successfully created a smaller version of BERT.

 The first paper, from researchers at Huawei, produces a model called TinyBERT, which is 7.5 times smaller and 9 4 times faster when compared to the original one. It is also capable of performing language comprehension on par with the original model. In the second paper, Google researchers compressed the BERT model by a factor of 60, “with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7MB”

The miniaturisation of BERT was accomplished by two variations of a technique known as knowledge distillation. Knowledge distillation involves using the large AI model that you want to shrink to train a much smaller model in its image (kind of a teacher-student AI relationship). Generally, researchers feed the same inputs into both and then tweaked the student until its outputs matched the teacher’s.

However, in the case of BERT, Huawei researchers employed a novel Transformer distillation method that is specially designed for knowledge distillation of the Transformer-based models. By using this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be well transferred to a small “student” TinyBERT. Furthermore, they introduced a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pre-training and task-specific learning stages which ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT.

In case of Google researchers, they employed “a dual-training mechanism that trains the teacher and student models simultaneously to obtain optimal word embeddings for the student vocabulary” and combined it “with learning shared projection matrices that transfer layer-wise knowledge from the teacher model to the student model.”

The benefits of a shrunken BERT is vast. First of all, it will undoubtedly improve access to state-of-the-art AI. The tiny models will also help bring the best features of BERT to consumer devices such as smartphones. Since these models don’t require sending consumer data to the cloud for processing, it will drastically improve the speed, accuracy, and privacy of personal assistants like Siri, Alexa, and others. The smaller models of BERT can also build more powerful text predictions and language generation.

The real use of tiny BERT comes in helping humans sift through a large pool of documents and in creating summaries and providing insights- something only humans were capable of doing till now. This means advanced smart research assistants that can serve corporates, law firms, hospitals, banks, and other industries. The key trait of BERT is that it can be fine-tuned to do any similar tasks in NLP efficiently.

For decades, true artificial intelligence capable of understanding human masters remained a dream, and now it is a few steps away.

Sources of Article

Source : INDIAai

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in