The recent increase in the popularity of diffusion models for picture production has renewed interest in the applicability of similar models in other areas of media synthesis. However, the application of diffusion models to music generation is a subject that has yet to be extensively investigated. 

It is a complex topic to solve because music generation, or audio generation, involves numerous components at various levels of abstraction. Despite being challenging, the field of research on automated or model-assisted music production has been busy. With the recent growth of deep learning models and their success in computer vision and natural language processing, it is also encouraging to observe how deep learning models can contribute to audio production. Utilizing recursive neural networks, adversarial generative networks, and transformers, existing audio generation models examine the utilization of these techniques. In addition, the most recent generative models were used for speech synthesis but not for music synthesis.

The development of music necessitates the management of various factors, including 

  • the temporal dimension, 
  • the long-term structure, 
  • multiple layers of overlapping sounds, and 
  • subtleties that only skilled listeners can perceive.

It is a challenging topic because the creation of music or audio involves several elements that are abstracted at different levels. Despite being hard, studying automated or model-assisted music production has gained popularity. Given the recent development of deep learning models and their success in computer vision and natural language processing, it is encouraging to see how much they can add to audio creation. Existing audio-generating models employ recursive neural networks, adversarial generative networks, autoencoders, and transformers.

Although they have been used in voice synthesis, diffusion models—a more recent breakthrough in generative models—have not yet been thoroughly examined for music production. 

The discipline of music synthesis also faces several ongoing challenges, such as the necessity to:

  • Create a long-term structure model.
  • Improve the audio quality.
  • Increase the range of music.
  • Allow for easier synthesis control, such as text prompts.

The general public can be encouraged to engage in the creative process by enabling individuals to compose music via an approachable text-based interface. It can also inspire creators and provide an inexhaustible supply of creative audio samples. 

The music industry would substantially benefit from adding a single model that accommodates all the mentioned features. In this work, the researchers study the possibility of diffusion models for text-conditional music production. In addition, the researchers devise a cascade latent diffusion method for producing numerous minutes of high-quality stereo music at 48kHz from textual descriptions. The researchers strive for adequate inference speed for each model, aiming for real-time on a single consumer GPU. Aside from trained models, the researchers offer a variety of open-source libraries to support future work on the subject.

Conclusion

In their work, the researchers presented Moûsai, an audio-generating approach based on waveforms comprising two diffusion models. First, the researchers trained a diffusion autoencoder to compress a magnitude-only spectrogram 64 times. Then, the compressed latent is decoded back to waveform using a bespoke 1D U-Net and diffusion.

In the second stage, the researchers train a diffusion model to build a new latent from noise conditioning on text embeddings retrieved from a frozen T5 transformer model, utilizing the same 1D U-Net architecture as in the first stage. In contrast to past efforts, the researchers demonstrate that their model can generate minutes of high-quality music in real-time on a consumer GPU with an appealing text-audio connection. In addition to trained models, the researchers give an assortment of open-source libraries to facilitate future work on the subject. The researchers anticipate their current work will pave the way for future applications that generate text-to-music with more outstanding quality and lengthier context.

Explore the Paper, Github, and Demo, respectively.

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in