In the present era of artificial intelligence, computers can create their own "art" using diffusion models. These models gradually enhance the structure of a noisy starting point until a distinct image or video is formed. Diffusion models have recently gained widespread attention.

By simply entering a few phrases, users can instantly immerse themselves in dreamlike scenarios that blend elements of both reality and imagination, resulting in a surge of dopamine. The procedure behind the scenes is intricate and time-consuming, including multiple iterations for the algorithm to achieve optimal image quality.

MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers have developed a novel framework that streamlines the complex multi-step process of classic diffusion models into a single step, thereby overcoming previous constraints. It is accomplished via a teacher-student approach, where a new computer model is trained to imitate the behaviour of more complex, original models that produce visuals. The technique, referred to as distribution matching distillation (DMD), preserves the quality of the produced images while enabling significantly faster creation.

This one-step diffusion model can improve design tools, allow for faster content production, and facilitate progress in drug development and 3D modelling, where speed and effectiveness are crucial.

DMD consists of two cleverly designed components. A regression loss is initially employed to establish a fixed reference point for the mapping process. It ensures a rough arrangement of the picture space, enhancing the training process's stability. Furthermore, it employs a distribution matching loss to guarantee that the likelihood of generating a specific image using the student model aligns with its frequency of occurrence in the real world. Two diffusion models guide the system's capacity to differentiate authentic and manufactured images. It enables the efficient training of the rapid one-step generator.

Compared to the standard approaches, DMD consistently performed better across various benchmarks. DMD achieves a super-close Fréchet inception distance (FID) score of just 0.3 on the popular ImageNet benchmark for producing images based on specific classes. It is noteworthy because FID is used to evaluate the quality and diversity of generated images. It is the first one-step diffusion technique to produce competitive images with the original, more complex models. On top of that, DMD accomplishes state-of-the-art one-step generation performance and does an excellent job of text-to-image generation on an industrial scale. There is still some space for improvement in the future, as there is a tiny quality gap when dealing with more complex text-to-image applications.

Furthermore, the capabilities of the instructor model utilized during distillation are organically related to the performance of the DMD-generated images. The student model inherits restrictions, such as creating detailed renderings of text and small faces, from the Stable Diffusion v1.5 teacher model. This method indicates that more improved teacher models could significantly improve DMD-generated graphics.

Conclusion

Diffusion models provide images of excellent quality, but they necessitate multiple iterations in the forward direction. The researchers propose Distribution Matching Distillation (DMD), a method to convert a diffusion model into a one-step image generator while minimizing adverse effects on image quality. The one-step image generator matching the diffusion model at the distribution level is enforced by minimizing an approximate KL divergence. The gradient of this divergence can be calculated as the difference between two score functions: one for the target distribution and the other for the synthetic distribution generated by our one-step generator. Two diffusion models trained independently on each distribution specify the score functions. 

Furthermore, by incorporating a basic regression loss that aligns with the overall structure of the multi-step diffusion outputs, the authors' approach surpasses all previously published few-step diffusion methods. It achieves a FID score of 2.62 on ImageNet 64x64 and 11.49 on zero-shot COCO-30k, comparable to Stable Diffusion but significantly faster. By employing FP16 inference, their model can produce images at 20 frames per second on contemporary hardware.

Sources of Article

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE