Transformers plays a major role in computer vision. They have achieved state-of-the-art results in many tasks concerning Natural language processing. Since the year 2012, CNN has been the dominant model in undertaking vision tasks. Several types of research were conducted on the use of a transformer in vision tasks to reduce architecture complexity and explore the scalability and training efficiency of the models. 

In 2022, the Vision Transformer (ViT) emerged as a competitive alternative to CNNs. The ViT models outperformed the CNN by almost 4 times concerning computational efficiency and accuracy. Along with the vision transformers, there has been an emerging interest in Multilayer Perceptrons as well in computer vision. Until recently, about computer vision, attention was given mostly to CNN. However, with the advent of Vision Transformers, the dependency on CNNs has not become mandatory. 

The ViT models have been outperforming CNN while receiving few computational resources for pre-trains. When in comparison, ViTs shows weaker inductive bias resulting in its reliance is mostly on model regularisation or data augmentation. The Vision transformer was originally designed for text-based tasks. They represent the input image as an image patch series and it directly predicts the class labels. The efficiency of a ViT model can be heightened if it is trained with enough amount of data. These models which have a higher success rate is now widely used for image recognition tasks. 

How does a ViT model work?

A ViT transformer split the image into visual tokens. It divides an image into fixed-size patches and embeds each of the patches correctly. Results are derived after including positional embedding as an input to the transformer encoder. The performance of the model entirely depends upon factors such as that of the optimiser, network depth and dataset-specific hyperparameters. The models are trained on datasets that are huge, as big as 14M images. The training is done even before fine-tuning the model. Mentioned following is the ViT architecture of the model.

  1. Split an image into patches (fixed sizes)
  2. Flatten image patches
  3. Create lower-dimensional linear embeddings Include positional embeddings
  4. Feed the sequence as an input to transformer encoder
  5. Pre-train the ViT model with image labels, which is then fully supervised on a big dataset
  6. Fine-tune on the downstream dataset for image classification

Vision Transformer Architecture

The ViT transformer model comes in different sizes- Base, large. There are different numbers of transformer layers and heads as well. For instance, ViT-/16 can be understood as a large vision model with 16*16 patch size.

Use Cases and Applications of a ViT model

The ViT transformers are widely used in tasks such as object detection, segmentation, image classification, and action recognition. These models are applied in multi-model tasks including visual grounding, visual-question answering, and visual reasoning. They are also used in video forecasting and activity recognition. Image enhancement, colourisation, and image super-resolution also use these transformer models. They are highly used in 3D analysis such as segmentation and point cloud classification. 

ViT models function in computer vision without image-specific biases. It can understand the local and global features of the image. ViT models have a higher precision rate on a large dataset with reduced training time. 

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in