Microsoft Research launches LongNet: a transformer variant that can scale sequence length to more than 1 billion tokens while maintaining no loss in shorter sequences.

Large language models (LLM) must handle increasingly long sequences, but present methods must either be faster or more relaxed to go beyond a certain length. To help overcome this barrier, Microsoft has introduced LongNet, a new type of Transformer that can handle sequences of more than one billion tokens without sacrificing speed on shorter ones.

Microsoft LongNet

Microsoft LongNet is a novel AI transformer model version proposed by the corporation in a paper. Transformer neural networks can process sequential data like speech and spoken language. LLM, such as OpenAi's GPT-4, Meta's LLaMA, and Google's PaLM 2, are built on a transformer model trained on large amounts of text data.

In recent years, scaling neural networks has become popular. Several powerful deep networks are generated, increasing the depth for exponential expressivity. Then, the hidden dimension is effectively extended using sparse MoE models and model parallelism approaches. The sequence length should be as long as possible as the neural network's final atomic dimension. When the sequence length constraint is lifted, there are various advantages. For starters, it provides models with a large memory and receptive area, allowing them to connect with people and their surroundings. Second, longer contexts contain more sophisticated causal chains and thinking processes, which models may use to train.

On the other hand, short dependency has more erroneous correlations, which is terrible for generalization. Third, it enables the exploration of the limitations of in-context learning, representing a paradigm shift for many-shot education because a very extended context may reduce catastrophic forgetting in the models. The fundamental challenge in increasing sequence length is achieving the optimal balance of computational complexity and model expressivity. 

RNN-style models

The primary purpose of RNN-style models is to increase length. Parallelization during training, critical in long-sequence modelling, is nonetheless bound by its sequential nature. Sequence modelling has recently gained popularity in state space models. During training, it may act as a CNN and transition to an effective RNN during testing. They excel at long-range benchmarks but struggle at shorter lengths. They fall short of the standard set by Transformers. 

It is due primarily to the model's expressivity. Scaling the sequence length also involves lowering the difficulty of transformers or the quadratic complexity of self-attention. Implementing sliding windows or convolution modules over the attention is straightforward to make the complexity virtually linear. However, doing so consumes memory for the early tokens, causing one to miss the prompts at the beginning of the series. Sparse attention reduces computation by sparsifying the attention matrix while retaining the ability to recall distant information.

Innovation

Microsoft LongNet's major innovation is dilated attention, which covers more tokens as the distance rises, lowering computation complexity and token reliance.

According to the research article, LongNet can perform well on both long-sequence modelling and general language problems, and it can be easily incorporated with existing Transformer-based optimization techniques. LongNet's possible uses for modelling exceedingly long sequences, such as using a whole corpus or perhaps the entire Internet, are also discussed in the study.

Benefits

LongNet has various advantages:

  • It features a rapid computation speed and a low token reliance.
  • For particularly long sequences, it can be used as a distributed trainer.
  • Its dilated focus is simple to incorporate into any existing Transformer-based optimization.

Conclusion

LongNet is still in the research phase, and Shapiro estimates we will see its capabilities in at least a year. Nonetheless, with the rapid progress of AI, a massively powerful intelligence may be closer than many experts previously imagined. Some predict that constructing a superintelligence will take at least 20 years, while others say humanity will never achieve it.

Sources of Article

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE