A language model consists of a probability distribution over word sequences. They are helpful for a range of computational linguistics problems, ranging from initial applications in speech recognition to ensure that nonsensical (i.e. low-probability) word sequences are not predicted to a broader application in machine translation. Various language models handle jobs ranging from extremely simple to quite complicated.

Roberta

Language model pretraining has led to significant improvements in performance, but it's hard to compare each method carefully. Training takes a lot of processing power and is often done on private datasets of different sizes. As we will show, the choices, you make about the hyperparameters significantly affect the final results. Researchers Devlin et al. (2019) present a replication study of BERT pretraining that carefully measures the effects of many key hyperparameters and the training data size. They found that BERT needed to be trained better and could match or beat the performance of every model that came out after it. Their best model gets the best results on GLUE, RACE, and SQuAD. These results demonstrate the significance of previously overlooked design decisions, prompting me to ponder the source of recent enhancements.

OPT-175B

The researchers are sharing Open Pretrained Transformer (OPT-175B), a language model with 175 billion parameters trained on publicly available data sets, as part of Meta AI's open science.

First time for a language technology system of this magnitude, the release contains the pre-trained models and the code required to train and use them. To preserve the model's integrity and prevent its misuse, we are releasing it under a noncommercial licence that focuses on research applications. 

Meta created OPT-175B with energy efficiency in mind by training a model of this scale with the only one-seventh carbon footprint of GPT-3. Megatron-LM accomplished this by integrating Meta's open-source Fully Sharded Data Parallel (FSDP) API and NVIDIA's tensor parallel abstraction.

XLNet

Carnegie Mellon University and Google researchers created a new model called XLNet to do NLP tasks such as reading comprehension, text classification, and sentiment analysis. Its autoregressive formulation permits the learning of bidirectional contexts, hence overcoming the restrictions of BERT. In addition, a generalized autoregressive pre-training approach is utilized.

In addition, XLNet incorporates Transformer-XL, the most advanced autoregressive model, into pretraining. As a result, XLNet surpasses BERT empirically on twenty tasks, most of the time by a significant margin, and achieves state-of-the-art performance on eighteen tasks, including question answering, natural language inference, sentiment analysis, and document ranking.

GPT3: Language Models Are Few-Shot Learners

Recent research has achieved substantial increases on various NLP tasks and benchmarks by pre-training a vast corpus of text and then fine-tuning a given task. While often task-agnostic in architecture, this strategy nevertheless requires thousands or tens of thousands of task-specific examples for fine-tuning. On the other hand, humans can typically accomplish a new language activity with only a few examples of simple instructions, something that current NLP systems still need to do.

In this paper, the researchers demonstrate that scaling up language models improves task-agnostic, few-shot performance, sometimes approaching parity with past state-of-the-art fine-tuning techniques. The researchers specifically train GPT-3, an autoregressive language model with 175 billion parameters, ten times more than any previous non-sparse language model, and evaluate its performance in the few-shot setting. GPT-3 is implemented without gradient updates or fine-tuning for all tasks, and tasks and few-shot demos are specified solely by textual interaction with the model. As a result, GPT-3 delivers strong performance on numerous NLP datasets, such as 

  • translation, 
  • question-answering, and 
  • cloze tasks

In addition, the researchers highlight datasets in which GPT-3's few-shot learning continues to struggle, as well as datasets in which GPT-3 encounters methodological challenges linked to training on big online corpora. Finally, they conclude that GPT-3 can generate samples of news stories that human evaluators have trouble differentiating from human-written articles.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size during the pretraining of natural language representations is frequently associated with higher performance on subsequent challenges. However, future model enhancements become more difficult at some point because of GPU/TPU memory restrictions, lengthier training cycles, and unforeseen model deterioration. To solve these issues, the researchers describe two parameter-reduction strategies that reduce memory usage and accelerate BERT training. Extensive empirical evidence demonstrates that their proposed methods produce models that scale considerably better than the original BERT.

The researchers also employ a self-supervised loss function that focuses on modelling inter-sentence coherence and demonstrates that it aids downstream tasks with multi-sentence inputs in a consistent manner. Consequently, their top model achieves new state-of-the-art performance on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters than BERT-large.

DistilBERT

DistilBERT has a different goal than previous models, which seeks to optimize BERT's performance. While XLNet, RoBERTa, and DeBERTa improved performance, DistilBERT aims to increase inference speed. It aims to decrease the size and increase the speed of BERT BASE and BERT LARGE, which have 110M and 340M parameters, respectively, while retaining as much power as feasible. As a result, DistilBERT shrinks BERT BASE by 40% and increases its speed by 60% while preserving 97% of its capabilities.

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE