AI is having a moment with text-to-image generation. A month ago, OpenAI announced DALL.E 2, the next version of their own DALL.E, which could make very realistic images based on a few prompts. Text-to-image generators made a big step forward with DALL.E 2. It could draw pictures that didn't make sense in many different ways, like a painting or a photograph. The tool pushed the limits of what people could imagine and could make a scene in seconds. But it looks like GoogleAI's Imagen will have to take over from DALL.E 2. Both models generate images from text inputs. However, Google researchers assert that their technology enables "exceptional photorealism and profound language comprehension." 

Like OpenAI's DALLE 2, Google's system makes pictures of things based on what users write. So if you ask for a vulture flying away with a laptop in its claws, you might get exactly that, made up on the spot.

What is Imagen ?

Along with the announcement, Google Research's Brain Team put out a paper called "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding." The team tested the model and compared it to other text-to-image programs like DALL.E 2, CLIP, and VQ-GAN+CLIP. Imagen was better than its competitors by a mile because it had better photorealism.

Imagen's work, Image source: Google AI

Imagen uses a large Transformer language model (T5) that creates a numerical representation of an image (called an "image embedding"), which a diffusion model then uses to create an image. During training, diffusion models see images that get noisier. After being trained, the models can turn this process around and make an image out of the noise.

Imagens obtains its images from the text comprehension of a big Transformer language model. Theoretically, an alternative language model might also be as input, which would alter the image quality.

Image source: Google AI

AI scaling is to blow up the low-resolution image (64 x 64) to 1024 x 1024 pixels, with the exact resolution as DALL-E 2. Like Nvidia's Deep Learning Super Sampling (DLSS), AI scaling adds new details to the original image that match the content, so the target resolution also has high sharpness. Imagen doesn't have to use as much computing power to do this upscaling process as it would if the model directly outputs high resolutions.

Evaluation

In the study, a new set of categories for text prompts called DrawBench was created so that we could measure the quality of text-to-image models. In addition, the new benchmark is all-encompassing. 

According to Google, human testers favoured Imagen's photos.

Image source: Google AI

DrawBench also uses more complicated and creative prompts or words that the model gets used to these commands. This model also pushes the model to develop more imaginative and strange images.

Imagen did better than DALL.E 2, GLIDE, and VQ-GAN, which uses CLIP, and Latent Diffusion based on people's scores. In addition, Imagen did much better than the other models in image quality and how well it lined up with the text.

How did Imagen outperform DALL-E 2?

Imagen outperformed DALL-E 2 in this test, which the researchers attributed to the text model's superior linguistic understanding. For example, Imagen converted the command "A panda making latte art" into the proper theme in most cases: a panda pouring milk into a latte correctly. Instead, DALL-E 2 makes a panda face out of the milk foam.

On the left are the images made by Imagen. Three out of four of them have a pattern that matches the input. On the right, four out of the four wrong ways to understand DALL-E 2.

Image source: Google AI

Conclusion

Imagen distinguished itself from other text-to-image models by implementing novel techniques like the famous DALL.E 2. While training with the most prominent text encoder from Google AI, T5-XXL, Imagen has 4.6 billion parameters. The research demonstrates that increasing the size of the text encoder improves text-image alignment and image fidelity. Furthermore, it indicates that scaling the size of the text encoder is far more advantageous than scaling the size of the diffusion model. While increasing the size of the U-Net diffusion model improves sample quality, a more extensive text encoder has a more significant influence overall.

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in