The usage of transformer architecture in recent times has gained significant traction and emerged as the prevailing approach for conducting Natural Language Processing (NLP) tasks in Machine Translation (MT). 

The observed architecture has demonstrated commendable scalability characteristics, whereby the augmentation of model parameters yields enhanced performance across a diverse range of natural language processing applications. Several studies and examinations have confirmed this phenomenon. While transformers demonstrate exceptional scalability, there is a concurrent effort to enhance the efficacy and practicality of these models for real-world deployment. It involves addressing latency, memory utilization, and disk space concerns.

Scholars have actively explored approaches to tackle these challenges, such as component trimming, parameter sharing, and dimensionality reduction. The Transformer architecture, which is extensively employed, consists of several crucial components, among which the Feed Forward Network (FFN) and Attention are particularly significant.

  • Attention - The Attention mechanism enables the model to record relationships and dependencies between words in a sentence, regardless of where they are. It serves as a mechanism to help the model determine which parts of the input text are most relevant to the word it is currently evaluating. It is essential for understanding the context and connections between words in a phrase.
  • The FFN is in charge of non-linearly transforming each input token independently. It increases the model's comprehension of each word's complexity and expressiveness by performing specific mathematical operations on the representation of each word.

A team of researchers has recently concentrated on exploring the role of the FFN within the Transformer design. They discovered that, despite being a large component of the model and consuming a significant number of parameters, the FFN has a high level of redundancy. They found that they could reduce the number of parameters in the model without considerably affecting accuracy. They accomplished this by removing the FFN from the decoder layers and replacing it with a single shared FFN across the encoder layers.

The researchers have communicated the benefits of this strategy, which are as follows.

  • Parameter Reduction: The model's parameters were significantly reduced by removing and consolidating the FFN components.
  • Despite removing many parameters, the model's accuracy only experienced a slight drop. This observation indicates functional redundancy to some extent between the multiple FFNs in the encoder and the decoder.
  • To restore the design to its original size while potentially improving performance, the researchers extended the concealed dimension of the shared FFN. Compared to the preceding large-scale Transformer model, notable enhancements in accuracy and model processing performance, namely latency, were seen.

Conclusion

Both attention and the FFN are crucial yet non-embedding parts of the Transformer design. While the FFN non-linearly changes each input token independently, attention catches interdependencies between words regardless of their position. Researchers in this study investigate the function of the FFN and discover that, despite accounting for a large number of model parameters, it needs to be more varied. 

Furthermore, by eliminating the FFN on the decoder levels and reusing a single FFN across the encoder, the research team can drastically reduce the number of parameters with only a small hit to accuracy. By expanding the secret dimension of the shared FFN, they can shrink this design back to its original size, improving its accuracy and latency significantly over the original Transformer Big.

Sources of Article

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE