Results for ""
Transformers plays a major role in computer vision. They have achieved state-of-the-art results in many tasks concerning Natural language processing. Since the year 2012, CNN has been the dominant model in undertaking vision tasks. Several types of research were conducted on the use of a transformer in vision tasks to reduce architecture complexity and explore the scalability and training efficiency of the models.
In 2022, the Vision Transformer (ViT) emerged as a competitive alternative to CNNs. The ViT models outperformed the CNN by almost 4 times concerning computational efficiency and accuracy. Along with the vision transformers, there has been an emerging interest in Multilayer Perceptrons as well in computer vision. Until recently, about computer vision, attention was given mostly to CNN. However, with the advent of Vision Transformers, the dependency on CNNs has not become mandatory.
The ViT models have been outperforming CNN while receiving few computational resources for pre-trains. When in comparison, ViTs shows weaker inductive bias resulting in its reliance is mostly on model regularisation or data augmentation. The Vision transformer was originally designed for text-based tasks. They represent the input image as an image patch series and it directly predicts the class labels. The efficiency of a ViT model can be heightened if it is trained with enough amount of data. These models which have a higher success rate is now widely used for image recognition tasks.
A ViT transformer split the image into visual tokens. It divides an image into fixed-size patches and embeds each of the patches correctly. Results are derived after including positional embedding as an input to the transformer encoder. The performance of the model entirely depends upon factors such as that of the optimiser, network depth and dataset-specific hyperparameters. The models are trained on datasets that are huge, as big as 14M images. The training is done even before fine-tuning the model. Mentioned following is the ViT architecture of the model.
Vision Transformer Architecture
The ViT transformer model comes in different sizes- Base, large. There are different numbers of transformer layers and heads as well. For instance, ViT-/16 can be understood as a large vision model with 16*16 patch size.
The ViT transformers are widely used in tasks such as object detection, segmentation, image classification, and action recognition. These models are applied in multi-model tasks including visual grounding, visual-question answering, and visual reasoning. They are also used in video forecasting and activity recognition. Image enhancement, colourisation, and image super-resolution also use these transformer models. They are highly used in 3D analysis such as segmentation and point cloud classification.
ViT models function in computer vision without image-specific biases. It can understand the local and global features of the image. ViT models have a higher precision rate on a large dataset with reduced training time.