Text-to-image generation is a technique of computer vision and natural language processing used to produce an image from a written description. It requires representing the input words meaningfully, like in a feature vector, and then utilizing that representation to create an image corresponding to the report.

Text-to-image models analyze images using computer vision algorithms to understand, label, and interpret them. Image generation is likely the future of technology, and it has already led to several innovations and breakthroughs, such as facial recognition and self-driving cars.

The datasets play a significant role in the comprehensiveness, accuracy, and variety of the generated images when training and testing these models. 

Image synthesis models utilize the following datasets most:

COCO (Microsoft Common Objects in Context)

The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale dataset for finding objects, separating them, finding critical points, and adding captions. There are 328K images in the dataset.

In 2014, the first version of the MS COCO dataset came out. It has 164K images, split into 83K training images, 41K validation images, and 41K test images. In 2015, a new test set of 81K images was made available, which included all of the previous test images and 40K new ones.

CUB-200-2011 (Caltech-UCSD Birds-200-2011)

The most popular dataset for fine-grained visual categorization tasks is the Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset. It has 11,788 images of birds from 200 subcategories. Five thousand nine hundred ninety-four are for training, and 5,794 are for testing. 

Each image has a label for a subcategory, the location of 15 parts, 312 binary attributes, and a bounding box. In addition, they added to the CUB-200-2011 dataset by getting more detailed descriptions written in natural language. For each picture, ten one-sentence descriptions are written down. The natural language descriptions are collected through the Amazon Mechanical Turk (AMT) platform. They must have at least ten words and no information about subcategories or actions.

Multi-Modal CelebA-HQ

Multi-Modal-CelebA-HQ is a large-scale face image dataset with 30,000 high-resolution face images chosen from the CelebA dataset using CelebA-HQ. Each image has a high-quality segmentation mask, sketch, text describing the image, and an image with a transparent background.

Multi-Modal-CelebA-HQ can be used to train and test algorithms for text-to-image generation, text-guided image manipulation, sketch-to-image generation, and GANs for face generation and editing.

Oxford 102 Flower (102 Category Flower Dataset)

Oxford 102 Flower is a set of 102 flower categories that can be used to classify images. The flowers chosen were ones that grow naturally in the UK. There are between 40 and 258 images in each class.

The images vary a lot in size, pose, and light. There are also categories with many differences within them and many similar categories.

LHQ (Landscapes High-Quality)

A set of 90,000 high-resolution images of natural landscapes crawled from Unsplash and Flickr and processed with Mask R-CNN and Inception V3.

LAION COCO

LAION-COCO is the most extensive set of high-quality captions for publicly available web images worldwide, with 600 million of them. The images are taken from the English subset of Laion-5B using an ensemble of BLIP L/14 and 2 CLIP versions (L/14 and RN50x64). This dataset lets models make captions for images that are of high quality.

Imagenet

Google's Language-Image Mixture of Experts (LIMoE) was trained on ImageNet, a database organized according to the WordNet hierarchy, using zero-shot learning with 5.6 billion parameters. Each node of the hierarchy depicts thousands of images, currently limited to images of nouns.

Sources of Article

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE