Researchers have made a machine-learning model for high-resolution computer vision. This model could allow vision apps that require a lot of computing power to run on edge devices, such as autonomous driving or medical image segmentation. 

Semantic segmentation

A self-driving vehicle must quickly and accurately identify the objects it encounters, such as an idling delivery truck halted on the corner and a cyclist speeding toward an approaching intersection. The vehicle may use a sophisticated computer vision model to categorize each pixel in a high-resolution scene image to avoid missing items in a lower-resolution image. However, this task, known as semantic segmentation, is complex and requires massive computation when the image resolution is high.

Computer vision model

Researchers have created a computer vision model that dramatically reduces the computational complexity of this task. Their model can conduct semantic segmentation accurately in real-time on a device with limited hardware resources, such as the onboard computers that allow an autonomous vehicle to make split-second decisions.

Recent state-of-the-art semantic segmentation models directly learn how each pair of pixels in an image interacts with each other. It means that their calculations grow quadratically as the image's resolution increases. Even though these models are correct, they are too slow to process high-resolution images in real-time on a sensor or cell phone.

High-resolution computer vision models

The researchers developed a new building block for semantic segmentation models that has the same abilities as these "state-of-the-art" models but uses less computing power and works better on hardware. The result is a new set of high-resolution computer vision models that run up to nine times faster on mobile devices than the old models. Notably, the accuracy of this new model series was the same or better than that of these other options.

This approach could improve high-resolution computer vision tasks like medical picture segmentation and enable self-driving cars to make real-time judgements.

Vision transformers

For a machine-learning model, classifying every pixel in a high-resolution image with millions of pixels is challenging. Vision transformers, a potent new model type, have recently been utilized effectively. Transformers were designed initially for processing native language. In this context, they encode every word in a sentence as a token and then generate an attention map that depicts the relationships between each token and every other token. This attention map assists the model's comprehension of context when making predictions.

Similarity function

Before constructing an attention map, a vision transformer separates a picture into pixel patches and encapsulates each patch as a token. The model employs a similarity function that directly learns the interaction between each pair of pixels to generate this attention map. In this manner, the model develops what is known as a global receptive field, which allows it to access all of the image's pertinent details. Since a high-resolution image may contain millions of pixels divided into thousands of patches, the attention map rapidly becomes enormous. Consequently, the quantity of computation increases quadratically as image resolution increases.

EfficientViT model

In their new EfficientViT model series, the researchers used a simplified mechanism to construct the attention map, replacing the nonlinear similarity function with a linear one. As a result, they can rearrange the order of operations to reduce the total number of calculations without sacrificing functionality or the global receptive field. With their model, the computation required to make a prediction increases linearly as image resolution increases.

Conclusion

The researchers made EfficientViT with a hardware-friendly architecture so it could be easy to run on different devices, like virtual reality headsets or edge computers on autonomous vehicles. Their model could also be used for other jobs in computer vision, such as classifying images. When they tested their model on datasets used for semantic segmentation, they found that it worked up to nine times faster on an Nvidia graphics processing unit (GPU) than other famous vision transformer models, with the same or better accuracy.

Furthermore, using these results as a starting point, the researchers want to use this method to speed up generative machine learning models, like those used to make new pictures. They also want to keep making EfficientViT bigger to use it for more vision jobs.

Sources of Article

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in