Researchers from the University of Illinois Urbana-Champaign and the Massachusetts Institute of Technology have developed a new technique that combines several models to produce more complex images and is easier to comprehend.

With the release of DALL-E, an artificial intelligence-based image generator inspired by Salvador Dali and the endearing robot WALL-E that uses natural language to create whatever enigmatic and lovely image your heart desires, the internet experienced a collective high. It was evident that the world found it fascinating to watch typed-out inputs come to life immediately, such as "smiling gopher holding an ice cream cone."

Scientists from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) reorganized the typical model from a different perspective to produce more complex images that are easier to understand. They combined several models so that they worked together to create the desired images, which captured the various aspects requested by the input text or labels. Each model would focus on a different part of the image to create two components described in two sentences.

Given natural language descriptions, large text-guided diffusion models, like DALLE-2, can produce stunningly photorealistic images. However, although these models are adaptable, they have trouble comprehending how some concepts are composed.

Image source: MIT CSAIL

The above photo illustration was created in Photoshop using images generated by an MIT system called Composable Diffusion. Pink dots and geometric, angular images were created using phrases like "diffusion model" and "network." At the top of the image is the phrase "a horse AND a yellow flower field." On the left are generated images of a horse and a yellow field, and on the right are combined images of a horse in a yellow flower field.

In their paper, the researchers suggest a different structured method for compositional generation using diffusion models. First, a set of diffusion models create an image, with each model simulating another aspect of the picture. To achieve this, the researchers interpret diffusion models as energy-based models that allow for the explicit combination of the data distributions defined by the energy functions. Moreover, the suggested approach can create test-time scenes that are significantly more complex than those seen during training, including sentence descriptions, object relations, human facial attributes, and even generalizing to new combinations.

The researchers went on to show how we could use their method to create pre-trained text-guided diffusion models and create photorealistic images with all the details from the descriptions provided as input, including the binding of some object attributes DALLE-2 has had trouble with in the past. These outcomes demonstrate how well the suggested approach encourages structured generalization for the visual generation. In addition, the researchers created diffusion models for image generation in their paper. As a result, we can explicitly compose diffusion models as energy-based models and generate images with significantly more complex combinations never seen during training by interpreting them as energy-based models.

Furthermore, the researchers proposed two compositional operators, concept conjunction and negation, which allow us to compose diffusion models during inference without additional training. The proposed composable diffusion models can generate images based on sentence descriptions, object relations, and human facial attributes. They can even generalize to rarely seen combinations in the real world. These findings show that the proposed method for compositional visual generation is effective.

Conclusion

The researchers want to investigate continuous learning as a potential next step now that Composable Diffusion can operate on top of generative models, such as DALL-E 2. One limitation of their current method is that, while we can combine multiple diffusion models, they are all instances of the same model.

Furthermore, when the researchers tried to combine diffusion models trained on different datasets, they had limited success. Compositional generation with  Energy-Based Models (EBMs), on the other hand, can successfully compose multiple separately trained models. Incorporating additional structures from EBMs, such as a conservative score field, into diffusion models may be a promising direction toward enabling compositions of separately trained diffusion models.

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE