Results for ""
A machine-learning approach for high-resolution computer vision could enable computationally intensive vision applications like autonomous driving and medical image segmentation on edge devices.
A self-driving vehicle must quickly and accurately identify the objects it encounters, such as an idling delivery truck halted on the corner and a cyclist speeding toward an approaching intersection.
The vehicle may use a sophisticated computer vision model to categorize each pixel in a high-resolution scene image to avoid missing items in a lower-resolution image. However, this task, known as semantic segmentation, is complex and requires massive computation when the image resolution is high.
Researchers have devised a more effective computer vision model, drastically reducing this task's computational complexity. Their model can conduct semantic segmentation accurately in real-time on a device with limited hardware resources, such as the onboard computers that allow an autonomous vehicle to make split-second decisions.
For this reason, the calculations of modern state-of-the-art semantic segmentation models expand quadratically as image resolution increases. While these models are accurate, they need to be more active to be used in real-time on an edge device such as a sensor or cell phone, despite their great accuracy.
Using linear computational complexity and hardware-efficient operations, the researchers developed a novel building block for semantic segmentation models with the same capabilities as these state-of-the-art models.
As a result, a new family of high-resolution computer vision models has been developed, which can be deployed on mobile devices at speeds up to nine times quicker than previous models. Equal to or superior accuracy was shown by this new model series when compared to the alternatives. This method could enhance the performance of other high-resolution computer vision applications, such as medical image segmentation, and its utility in assisting autonomous vehicles with real-time decision-making.
When asked to classify the millions of individual pixels in a high-resolution image, machine learning models face a formidable challenge. Recently, a robust new kind of model called a vision transformer has been put to good use.
The original intent of transformers was for use in NLP (natural language processing). In this setting, they treat each word as a token and create an attention map displaying their connections. The information provided by the attention map bolstered the model's ability to produce accurate predictions.
Vision transformers slice an image into small patches of pixels and assign tokens to each to produce an attention map. The model employs a similarity function that learns the direct interaction between every pair of pixels to generate this attention map. A global receptive field is created, allowing the model access to all the essential details in the image.
The attention map will soon grow since a high-resolution image can have millions of pixels. As a result, the computation required to process an image with increasing resolution climbs at a quadratic rate. The new EfficientViT model series from MIT's researchers uses a less complex approach to construct the attention map by switching to a linear similarity function from a nonlinear one. It allows them to reorganize the activities flow to reduce the required calculations without compromising functionality or the global receptive field. Their model predicts a linear increase in computing time as the image resolution increases.
On semantic segmentation datasets, their model was up to nine times faster on an Nvidia GPU than other prominent vision transformer models while maintaining or improving accuracy.
Furthermore, the researchers hope to use this method to expedite other generative machine-learning models, such as those used to create new images, based on similar findings. In addition, they hope to expand the use of EfficientViT in larger-scale vision activities.
Image source: Unsplash