A machine-learning approach for high-resolution computer vision could enable computationally intensive vision applications like autonomous driving and medical image segmentation on edge devices. 

A self-driving vehicle must quickly and accurately identify the objects it encounters, such as an idling delivery truck halted on the corner and a cyclist speeding toward an approaching intersection.

The vehicle may use a sophisticated computer vision model to categorize each pixel in a high-resolution scene image to avoid missing items in a lower-resolution image. However, this task, known as semantic segmentation, is complex and requires massive computation when the image resolution is high.

Computer vision model

Researchers have devised a more effective computer vision model, drastically reducing this task's computational complexity. Their model can conduct semantic segmentation accurately in real-time on a device with limited hardware resources, such as the onboard computers that allow an autonomous vehicle to make split-second decisions.

For this reason, the calculations of modern state-of-the-art semantic segmentation models expand quadratically as image resolution increases. While these models are accurate, they need to be more active to be used in real-time on an edge device such as a sensor or cell phone, despite their great accuracy.

Linear computational complexity

Using linear computational complexity and hardware-efficient operations, the researchers developed a novel building block for semantic segmentation models with the same capabilities as these state-of-the-art models.

As a result, a new family of high-resolution computer vision models has been developed, which can be deployed on mobile devices at speeds up to nine times quicker than previous models. Equal to or superior accuracy was shown by this new model series when compared to the alternatives. This method could enhance the performance of other high-resolution computer vision applications, such as medical image segmentation, and its utility in assisting autonomous vehicles with real-time decision-making.

Vision transformer

When asked to classify the millions of individual pixels in a high-resolution image, machine learning models face a formidable challenge. Recently, a robust new kind of model called a vision transformer has been put to good use.

The original intent of transformers was for use in NLP (natural language processing). In this setting, they treat each word as a token and create an attention map displaying their connections. The information provided by the attention map bolstered the model's ability to produce accurate predictions.

Attention map

Vision transformers slice an image into small patches of pixels and assign tokens to each to produce an attention map. The model employs a similarity function that learns the direct interaction between every pair of pixels to generate this attention map. A global receptive field is created, allowing the model access to all the essential details in the image.

EfficientViT model

The attention map will soon grow since a high-resolution image can have millions of pixels. As a result, the computation required to process an image with increasing resolution climbs at a quadratic rate. The new EfficientViT model series from MIT's researchers uses a less complex approach to construct the attention map by switching to a linear similarity function from a nonlinear one. It allows them to reorganize the activities flow to reduce the required calculations without compromising functionality or the global receptive field. Their model predicts a linear increase in computing time as the image resolution increases.

Conclusion

On semantic segmentation datasets, their model was up to nine times faster on an Nvidia GPU than other prominent vision transformer models while maintaining or improving accuracy.

Furthermore, the researchers hope to use this method to expedite other generative machine-learning models, such as those used to create new images, based on similar findings. In addition, they hope to expand the use of EfficientViT in larger-scale vision activities.

Sources of Article

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in