Results for ""
In 2021, computer vision became mainstream. In addition, Computer vision has developed into a powerful tool for driving industry transformation due to recent technological advancements in artificial intelligence (AI) and deep learning. Moreover, computer vision is critical in augmented and virtual reality, the technologies that enable computing devices such as smartphones, tablets, and smart glasses to overlay and embed virtual objects in real-world imagery.
The following are the year's top ten most interesting research papers in computer vision. In a nutshell, it's a curated list of the most significant advances in AI and CV, each accompanied by a concise video explanation, a link to a more detailed article, and code.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
This article introduces a novel vision, Swin Transformer. The researchers propose a hierarchical Transformer using textbfShifted textbfwindows. This hierarchical architecture can be modelled at multiple scales and has a linear computational complexity proportional to the image size. Additionally, the hierarchical design and shifted window approach benefit all architectures.
Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Code: Click here for the code
Total Relighting: Learning to Relight Portraits for Background Replacement
The researchers propose a novel system for relighting portraits and replacing backgrounds. The design preserves high-frequency boundary details and accurately simulates the subject's appearance when illuminated by novel illumination, resulting in realistic composite images for any desired scene. The technique utilises alpha matting, relighting, and compositing to determine the foreground.
Paper: Total Relighting: Learning to Relight Portraits for Background Replacement
Zero-Shot Text-to-Image Generation
Historically, text-to-image generation focuses on developing more accurate modelling assumptions for training on a fixed dataset. These assumptions may include complex architectures, auxiliary losses, or additional information supplied during training, such as object part labels or segmentation masks. The researchers describe a straightforward approach to this problem based on a transformer that uses autoregressive modelling to model text and image tokens as a single data stream. When sufficient data and scale are available, their approach outperforms previous domain-specific models in a zero-shot evaluation.
Paper: Zero-Shot Text-to-Image Generation
Code: For code & more information
Taming Transformers for High-Resolution Image Synthesis
Transformers learn long-range interactions on sequential data and continue to achieve state-of-the-art performance on a wide variety of tasks. Unlike convolution neural networks (CNN), they lack an inductive bias that favours local interactions. The researchers demonstrate how they can model and thus synthesise high-resolution images by combining the effectiveness of CNNs' inductive bias with the expressivity of transformers. The researchers, in particular, present the first results on semantically guided megapixel image synthesis using transformers.
Paper: Taming Transformers for High-Resolution Image Synthesis
Code: Taming Transformers
Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image
The researchers discuss perpetual view generation, the long-range generation of novel views corresponding to an arbitrarily long camera trajectory given a single image. Here, the researchers present a dataset of aerial footage of coastal scenes. It can generate plausible scenes over much longer time horizons and with much larger camera trajectories than existing methods.
Paper: Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image
Code: Click here for the code
Demo: Colab demo
GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields
The researchers have developed a modified Generative adversarial networks (GAN) architecture to move objects within an image without affecting the background or other objects. You can disentangle one or more objects from their surroundings. Specifically, from the shapes and appearances of individual objects while learning without additional supervision from unstructured and unposed image collections. Combining this scene representation with a neural rendering pipeline can create a fast and realistic image synthesis model.
Paper: GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields
Code: Click here for the code
TimeLens: Event-based Video Frame Interpolation
TimeLens can decipher the movement of particles between the frames of a video to reconstruct what occurred at a speed that our eyes cannot perceive. Indeed, it achieves results that our intelligent phones and previous models were incapable of achieving.
Paper: TimeLens: Event-based Video Frame Interpolation
Code: Click here for the code
Animating Pictures with Eulerian Motion Fields
The researchers demonstrate an automated method for transforming a static image into a realistic animated looping video. Moreover, the researchers demonstrate the method's effectiveness and robustness by examining a diverse set of examples, including beaches, waterfalls, and flowing rivers.
Paper: Animating Pictures with Eulerian Motion Fields
Code: Click here for the code
Conclusion
Computer vision is an exciting field of study. Even supercomputers can take days, weeks, or even months to complete a task, but this technology makes it fast, and when combined with cloud networks, it achieves lightning speed.