Computer vision's progress has been accelerating in recent years, primarily due to the abundance of extensive visual data, the advancement of robust computing hardware, and the enhancement of deep learning techniques. 

Computer vision has facilitated numerous applications previously regarded as speculative, including facial recognition, autonomous vehicles, robotic automation, medical anomaly detection, and others.

Since the emergence of Stable Diffusion, DALL.E, and Midjourney on the Internet, there has been significant growth in text-to-image and text-to-vision models. As we enter 2024, there are no indications that this growth will decelerate.

Let us examine some of the most captivating computer vision models introduced in the constantly changing environment. 

LEGO

ByteDance and Fudan University have jointly developed a comprehensive and advanced model dubbed LEGO, which can accurately capture detailed local information and precisely identify and localise objects in photos or videos.

The model is trained using a dataset containing various information types and at different levels of detail. This training approach leads to better performance on tasks that need a thorough knowledge of the data. A comprehensive dataset with several grounding modes was created to address data shortages. The model, code, and dataset are available as open source to encourage progress in the field. 

Motionshop

Alibaba has unveiled a new framework called Motionshop that enables substituting video characters with 3D avatars. The framework consists of two primary components: a video processing pipeline for extracting the background and a posture estimation/rendering pipeline for generating the avatars. The procedure is accelerated via parallelization and the utilization of a high-performance ray-tracing renderer (TIDE), allowing for completion within minutes.

Furthermore, it utilizes pose estimation, animation retargeting, and light estimation to ensure seamless integration of 3D models. TIDE is used throughout the rendering step to enhance video creation by incorporating photorealistic elements. The ultimate video combines the rendered image with the actual footage through compositing.

Zero-shot Identity-Preserving Generation in Seconds

Although techniques such as Textual Inversion, DreamBooth, and LoRA have made significant advancements in personalized picture synthesis, their practical implementation is typically impeded by the need for large storage capacity, time-consuming fine-tuning, and dependence on several reference images. However, ID embedding-based approaches encounter difficulties such as the need for substantial fine-tuning, incompatibility with pre-trained models, and degraded face fidelity.

The problem was addressed by developing the InstantID model based on the diffusion concept. This module is designed to handle image customisation in many styles using only one facial image, assuring high accuracy and quality. InstantID presents IdentityNet, which integrates robust semantic and limited spatial criteria for picture synthesis.

It seamlessly incorporates well-known text-to-image models such as SD1.5 and SDXL, providing a flexible plugin. Our approach has exceptional performance in generating content while retaining the source's identity, making it highly valuable in practical situations. Although issues remain, such as the interdependence of face features and the possibility of biases, the system continues to exhibit strength, compatibility, and efficiency. 

Autoregressive Image Models

Apple has unveiled AIM, a collection of autoregressive image models that draw inspiration from large language models (LLMs) and demonstrate comparable scalability to their textual equivalents. The main results suggest that the effectiveness of visual features improves as the model's capacity and the amount of data increase. Additionally, there is a correlation between the value of the objective function and the performance of downstream tasks. By pre-training a neural network model called AIM with seven billion parameters using two billion photos, the model obtained an accuracy of 84.0% on the ImageNet1k dataset. This accuracy was maintained even when the core element of the model (trunk) was kept intact and unchanged.

AIM, which is pre-trained in a manner comparable to Language Models (LLMs), offers a scalable approach that does not rely on tactics specific to images. It shows promise as a novel area for training large-scale vision models, with attractive characteristics and significant connections between pre-training and subsequent performance. The absence of saturation indicators in the models indicates the potential for additional enhancements in performance by training larger models for more extended periods.

Parrot: Text-to-Image Generation

An intriguing team of researchers from Google Research, Google DeepMind, OpenAI, Rutgers University, and Korea University has developed a new reinforcement learning (RL) framework called Parrot. Its goal is to optimize different quality rewards to improve text-to-image (T2I) creation.

Parrot uses batch-wise Pareto optimum selection to address issues with over-optimization and manual weight selection. It improves the production of quality-aware text prompts by co-training the T2I model and eliciting an expansion network. Parrot additionally includes original prompt-centred coaching at inference time to help avoid forgetting the original user prompt. Experiments and a user survey show that Parrot performs better than baseline techniques in several quality metrics, such as text-image alignment, human preference, aesthetics, and image sentiment.

Sources of Article

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in