There is finally another AI system to dethrone OpenAI’s GPT-3 as the largest language model. Known as the Megatron-Turing Natural Language Generation (MT-NLG), this new model has been developed jointly by Microsoft and NVIDIA. It has emerged as the successor to Turing NLG 17B and Megatron-LM.

MT-NLG is an example of what is possible when supercomputers like NVIDIA Selene or Microsoft Azure NDv4 are used with software breakthroughs of Megatron-LM and DeepSpeed to train large language AI models.

What is an ML language model, anyway? 

These are large neural networks that facilitate interaction between humans and machines in ways that are more natural than artificial. It is due to the advances in this area that machines can perform NLP tasks such as summarising a long piece of text, translating text across languages or augmenting customer experience through digital assistants.

Language models with large numbers of parameters, more data, and more training time acquire a richer, more nuanced understanding of language. It is, therefore, no surprise that the number of parameters in state-of-the-art NLP models have grown at an exponential rate, as evidenced in the figure below:

So what differentiates the MT-NLG?

At 530bn parameters, it is the largest and the most powerful monolithic transformer language model trained to date. It has 3x the number of parameters compared to the existing largest model. Training a model is an expensive and compute-intensive process - the model recognises patterns and converts those into parameters. A greater number of parameters increase accuracy by providing more reference points to generate outputs.

Further, MT-NLG has improved upon the prior models in zero-, one-, and few-shot settings and set the new standard for large-scale language models in both model scale and quality. 

But what were the challenges along the way?

There were two major challenges pertaining to GPU memory and training time. With the number of parameters running into billions, it is no longer possible to fit them in the memory of even the largest GPU. Further, the large number of compute operations required can result in unrealistically long training times.

Therefore, NVIDIA and Microsoft achieved an unprecedented training efficiency by converging a state-of-the-art GPU-accelerated training infrastructure with a cutting-edge distributed learning software stack.

Model training is done with mixed precision on the NVIDIA DGX SuperPOD-based Selene supercomputer powered by 560 DGX A100 servers networked with HDR InfiniBand in a full fat tree configuration. Each DGX A100 has eight NVIDIA A100 80GB Tensor Core GPUs, fully connected to each other by NVLink and NVSwitch. A similar reference architecture is used by Microsoft for Azure NDv4 cloud supercomputers.

And what does this success imply?

MT-NLG has pushed the boundaries of Natural Language Processing even further. It can do natural language tasks with high accuracy, including text prediction, reading comprehension, common sense reasoning, natural language reasoning, and word meaning disambiguation. By pushing the state-of-the-art in NLP, this is a big step forward in the journey towards unlocking the full promise of AI in natural language.

Sources of Article

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE