In 2021, computer vision became mainstream. In addition, Computer vision has developed into a powerful tool for driving industry transformation due to recent technological advancements in artificial intelligence (AI) and deep learning. Moreover, computer vision is critical in augmented and virtual reality, the technologies that enable computing devices such as smartphones, tablets, and smart glasses to overlay and embed virtual objects in real-world imagery.

The following are the year's top ten most interesting research papers in computer vision. In a nutshell, it's a curated list of the most significant advances in AI and CV, each accompanied by a concise video explanation, a link to a more detailed article, and code.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

This article introduces a novel vision, Swin Transformer. The researchers propose a hierarchical Transformer using textbfShifted textbfwindows. This hierarchical architecture can be modelled at multiple scales and has a linear computational complexity proportional to the image size. Additionally, the hierarchical design and shifted window approach benefit all architectures.

Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Code: Click here for the code

Total Relighting: Learning to Relight Portraits for Background Replacement

The researchers propose a novel system for relighting portraits and replacing backgrounds. The design preserves high-frequency boundary details and accurately simulates the subject's appearance when illuminated by novel illumination, resulting in realistic composite images for any desired scene. The technique utilises alpha matting, relighting, and compositing to determine the foreground.

Paper: Total Relighting: Learning to Relight Portraits for Background Replacement

Zero-Shot Text-to-Image Generation

Historically, text-to-image generation focuses on developing more accurate modelling assumptions for training on a fixed dataset. These assumptions may include complex architectures, auxiliary losses, or additional information supplied during training, such as object part labels or segmentation masks. The researchers describe a straightforward approach to this problem based on a transformer that uses autoregressive modelling to model text and image tokens as a single data stream. When sufficient data and scale are available, their approach outperforms previous domain-specific models in a zero-shot evaluation.

Paper: Zero-Shot Text-to-Image Generation

Code: For code & more information

Taming Transformers for High-Resolution Image Synthesis 

Transformers learn long-range interactions on sequential data and continue to achieve state-of-the-art performance on a wide variety of tasks. Unlike convolution neural networks (CNN), they lack an inductive bias that favours local interactions. The researchers demonstrate how they can model and thus synthesise high-resolution images by combining the effectiveness of CNNs' inductive bias with the expressivity of transformers. The researchers, in particular, present the first results on semantically guided megapixel image synthesis using transformers.

Paper: Taming Transformers for High-Resolution Image Synthesis

Code: Taming Transformers

Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image 

The researchers discuss perpetual view generation, the long-range generation of novel views corresponding to an arbitrarily long camera trajectory given a single image. Here, the researchers present a dataset of aerial footage of coastal scenes. It can generate plausible scenes over much longer time horizons and with much larger camera trajectories than existing methods.

Paper: Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image

Code: Click here for the code

Demo: Colab demo

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

The researchers have developed a modified Generative adversarial networks (GAN) architecture to move objects within an image without affecting the background or other objects. You can disentangle one or more objects from their surroundings. Specifically, from the shapes and appearances of individual objects while learning without additional supervision from unstructured and unposed image collections. Combining this scene representation with a neural rendering pipeline can create a fast and realistic image synthesis model.

Paper: GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

Code: Click here for the code

TimeLens: Event-based Video Frame Interpolation

TimeLens can decipher the movement of particles between the frames of a video to reconstruct what occurred at a speed that our eyes cannot perceive. Indeed, it achieves results that our intelligent phones and previous models were incapable of achieving.

Paper: TimeLens: Event-based Video Frame Interpolation

Code: Click here for the code

Animating Pictures with Eulerian Motion Fields

The researchers demonstrate an automated method for transforming a static image into a realistic animated looping video. Moreover, the researchers demonstrate the method's effectiveness and robustness by examining a diverse set of examples, including beaches, waterfalls, and flowing rivers.

Paper: Animating Pictures with Eulerian Motion Fields

Code: Click here for the code

Conclusion

Computer vision is an exciting field of study. Even supercomputers can take days, weeks, or even months to complete a task, but this technology makes it fast, and when combined with cloud networks, it achieves lightning speed.

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE