OpenAI, a Microsoft-affiliated AI firm, has announced an AI system that can take a description of an object or scene and automatically generate a very realistic image depicting it. 

A half-decade ago, the world's leading AI labs developed systems capable of recognizing items in digital photographs and even generating images on their own, including flowers, pets, vehicles, and faces. Then, they developed algorithms to effectively summarize articles, respond to inquiries, create tweets, and even write blog posts a few years later. Now, researchers are integrating these technologies to create new forms of AI. 

What is Dall-E?

OpenAI revealed Dall-E on January 5, 2021. Dall-E is a 12-billion-parameter variant of GPT-3 trained on a dataset of text-image pairs to produce images from text descriptions. In addition, Dall-E can create anthropomorphized representations of animals and objects, combine unrelated concepts in believable ways, generate text, and apply alterations to existing images, among other things.

Dall-E is in cooperation with CLIP (Contrastive Language-Image Pre-training). CLIP is a distinct model tasked with "understanding and ranking" its output. Moreover, CLIP curates Dall-E's photos, presenting just the highest-quality photographs for any particular query.

For more information, read the article.

View the code here.

How does it work?

Dall-E interprets natural language inputs (such as "a green leather purse shaped like a pentagon" or "an isometric depiction of a sorrowful capybara"). In addition, it is capable of creating images of both realistic and fictitious objects ("a stained glass window with an image of a blue strawberry") ("a cube with the texture of a porcupine"). Its name is a mashup of the words WALL-E and Salvador Dali.

OpenAI has not disclosed the source code for either model. However, a "controlled demo" of Dall-E is available on OpenAI's website, where users can observe output from a limited set of example prompts. Others have developed open-source alternatives trained on lesser amounts of data, such as Dall-E Mini.

Features

OpenAI created the Generative Pre-trained Transformer (GPT) model in 2018, based on the Transformer architecture. 

  • Dall-E generates output from a description and cue using zero-shot learning, which requires no additional training.
  • Dall-E responds to requests by generating various images.
  • CLIP recognizes and categorizes these photos.
  • CLIP trains approximately 400 million image-text pairs.
  • CLIP is an image recognition system trained from Internet photos and descriptions rather than a controlled library of annotated images (such as ImageNet).
  • CLIP connects photos to full captions.
  • Researchers trained CLIP to estimate which caption (out of 32,768 possible captions) was most relevant for each image, allowing it to recognize objects in photos outside its training set.

What is Dall-E's potential?

Dall-E can create imagery in several forms, including photorealistic images, paintings, and emojis. In its photos, it can also "manipulate and reorganize" items. For example, according to its inventors, when instructed to sketch a daikon radish blowing its nose, sipping a latte, or riding a unicycle, Dall-E frequently draws the handkerchief, hands, and feet in plausible placements.

Dall-E 2 overview

OpenAI has released a new version of Dall-E, its text-to-image generation tool. Dall-E 2 is a higher-resolution and lower-latency variant of the original system, generating images based on user-written descriptions. 

Aditya Ramesh recently tweeted that they have developed a new text-to-image generating approach called unCLIP, the foundation of DALLE 2. According to them, DALLE 2 is a new AI system that can create realistic visuals and art from a natural language description.

https://twitter.com/model_mechanic/status/1511857530237964293?cxt=HHwWisCqjbyamfspAAAA

It also has additional features, such as altering an existing image. Like past OpenAI initiatives, the technology will not be available to the general public. However, researchers can sign up for a free trial of the system online, and OpenAI aims to make it available for use in third-party apps in the future.

Conclusion

Dall-E is "remarkably resilient to such alterations" and can reliably provide images for various random descriptions. CNBC's Sam Shead described the photos as "quirky," quoting Neil Lawrence, a professor of machine learning at the University of Cambridge, as an "inspirational proof of these models' capacity to store information about our world and generalize in ways that humans find quite natural." Similarly, Mark Riedl, an associate professor at Georgia Tech's School of Interactive Computing, says, "the Dall-E demo is remarkable for producing much more coherent illustrations than those produced by other Text2Image systems I've seen in the last few years." 

However, Dall-E 2 is far from flawless. Occasionally, the system is unable to render details in complex scenarios. For example, some lighting and shadow effects can be slightly off or the borders of two objects that should be distinct to become merged. Additionally, it is less adept at comprehending "binding properties" than other multimodal AI applications. 

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE