In addition to traditional tasks such as prediction, classification and translation, deep learning is receiving growing attention as an approach for music generation, as witnessed by recent research groups such as Magenta at Google and CTRL (Creator Technology Research Lab) at Spotify. The motivation is to use the capacity of deep learning architectures and training techniques to automatically learn musical styles from arbitrary musical corpora and then generate samples from the estimated distribution. However, a direct application of deep learning to generate content rapidly reaches limits as the generated content tends to mimic the training set without exhibiting true creativity.  

Moreover, deep learning architectures do not offer direct ways of controlling generation (e.g., imposing some tonality or other arbitrary constraints). Furthermore, deep learning architectures alone are autistic automata which generate music autonomously without human user interaction, far from the objective of interactively assisting musicians in composing and refining music. Our analysis focuses on control, structure, creativity and interactivity. 

 Various methods have been employed to generate audio conditioned on external input. Some relevant examples are provided in the text-to-audio task, in which text-conditioned spectrogram generation and spectrogram-conditioned audio have been intensively studied. Restricting our attention to audio generation based on descriptive text, text-conditioned general sound event generation has been approached with auto-regressive methods by Audio Generation and diffusion-based methods that operate on discrete audio codes by DiffSound. 

Stability AI's new AI 

 London-based generative AI company Stability AI has launched its first text-to-audio platform, Stable Audio, allowing users to generate personalized audio tracks. The AI-powered platform represents the company's first foray into music and sound generation. It can produce songs of up to 90 seconds in length, making it suitable for various projects, including commercials, audiobooks and video games. 

 The company has been one of the prominent leaders in the AI world. However, until now, it was mostly known for AI-generated visuals. However, with the introduction of its first text-to-audio AI platform, it is in direct competition with other industry leaders, including OpenAI, Google and Meta. 

Training the model  

The platform uses a diffusion model, the same AI model that powers the company's more popular image platform, Stable Diffusion. However, in the case of its text-to-audio-based Stable Audio, the model has been trained with audio data instead of images. This allows users to generate songs or background audio of any length, making it a versatile tool for various projects. 

Additionally, the platform addresses the limitations of conventional audio diffusion models by undergoing music-specific training and incorporating text metadata that specifies a song's starting and ending times. It enables users to generate songs of any length, a valuable feature for music production. 

What makes it different? 

The already-available audio diffusion models could only generate audio clips of fixed durations. This limited their ability to produce complete songs. Stability AI has improved the model to provide users of Stable Audio with greater flexibility in determining the length of the generated song, granting them more control over the creative process. 

The Stability Audio offers three pricing tiers for users who want to use the platform. The free version allows users to generate up to 45 seconds of audio for a maximum of 20 tracks per month. The Professional level is priced at $11.99 and will enable users to create 500 tracks and finally, an enterprise subscription is also available for companies seeking customized plans. 

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE