Meta’s open source MUSICGEN: a single language model to create high-quality music from text or melody

June 13, 2023

Meta proposes MUSICGEN, a simple and controllable tool that generates high-quality music at 32 kHz based on text or melody prompts. The tool is open-source and publicly available for the music community to explore, reproduce and improve. It can create different styles and genres of music that suit the user’s tastes and preferences.

MUSICGEN can generate music faster and more flexibly than other similar tools such as MusicLM, Riffusion, Mousai, and Noise2Music. The model can also change the music it creates, depending on the words or tunes given as input. You can listen to generated samples here. 🎧

The approach underwent comprehensive testing, including evaluation using automated metrics as well as feedback from human evaluators.

The results indicated that it outperforms alternative methods on a widely recognized text-to-music benchmark, with a subjective rating of 84.8 out of 100 for MUSICGEN against 80.5 for the best baseline.

Furthermore, it was observed that each of its individual components plays an important role in the quality of the generated music.

Creating music from text is a hard task because you have to make long and complex musical patterns based on a short text like this example: “90s rock song with a guitar riff”. Music also needs a higher quality of sound than speech, which means more data to process.

While the signal rate for speech has 16 kHz, to generate music you have to sample the signal at a rate of 44.1 or 48 kHz. In addition, you have to combine different sounds from different instruments and follow different genres and styles of music.

To better deal with the audio data, recent methods convert the audio signals into tokens.

These tokens are like musical notes, but they also include other features, such as rhythm and pitch. They are used to compress the audio sample and make the music simpler and easier to process by the AI.

The figure below shows different ways of combining the streams of acoustic tokens. During the model evaluation, the “delay” and “flattening” patterns achieved similar scores, while “parallel” and “VALL-E” obtain worse scores.

The general framework for modeling multiple parallel streams of acoustic tokens using different patterns: flattening pattern, VALL-E pattern, parallel pattern, and delay pattern.

The model

MUSICGEN is a single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 parts (codebooks) sampled at 50 Hz. It supports conditional generation based on either text and/or melody. The model extracts the information from the input melody and encodes it into tokens that can be used by the language model to generate new music.

For text encoding, they used the T5, FLAN-T5 and CLAP.
For audio tokenization they used EnCodec tokenizer (see picture below).

EnCodec splits the audio sample into 4 parts called codebooks. Each codebook contains a set of codewords that represent different aspect of the audio, such as pitch, timbre, volume, or rhythm.

For example, if you have a melody with many different notes, you can reduce the number of notes to a few by assigning each note to the closest token in a predefined codebook.

The first codebook is the most important one, as it captures the essential features of the audio data sample.

Training

The Transformer language model was trained on 20,000 hours of licensed music with textual description and additional information_. _During the training it learned how to combine the tokens into streams and create new melodies corresponding to the conditions given by the user.

The input can be a melody or a text like: “A funky disco song with catchy bass and horns” or “A cheerful pop song with piano and guitar”.

To better control the generated samples in terms of harmonic and melodic structure, the team introduced unsupervised melody conditioning (to produce the same melody in different genres or styles).

Evaluation results

The model followed both automatic and human evaluation.

The automatic evaluation measured the accuracy of the musical concepts, the alignment of the audio and text, and the quality of the audio samples.

The human evaluation measured the subjective preferences of the listeners on various aspects of the music.

The results showed that MUSICGEN outperformed other comparable models such as MusicLM, Riffusion, Mousai, and Noise2Music on both objective and subjective measures.

It was able to generate high-quality music that was better aligned with the text and melody prompts, while also being more controllable than other methods.

Example

The figure below is an example of music generation by the model. We can see three different chromagrams for three different pieces of music: the reference melody (left) and the generated melodies (middle and right).

A chromagram is a graphic representation of the notes of a scale. It shows the intensity of each pitch in a piece of music over time.

Chromagrams from reference melody (left), and generated music conditioned on melody and text (middle) and with text-only conditioning (right)

The text prompt for all three chromagrams is “90s rock song with electric guitar and heavy drums”. This means that the music should sound like rock music from the 1990s, with instruments like electric guitar and drums.

The left chromagram is based on a reference melody.
The middle one is generated by MUSICGEN, based on text & melody prompts.
The right one is also generated by MUSICGEN, but only based on text prompts.

We can observe that the middle chromagram follows the melody prompt very closely, while also adding some variations and styles guided by the text prompt. The right chromagram does not follow any melody prompt, so it creates its own melody based on the text prompt.

Datasets

The training and evaluation datasets of MUSICGEN are:

Training datasets: The model was trained on an internal dataset, Shutterstock music collection and Pond5 music data set.
Evaluation datasets: The model was evaluated on the MusicCaps benchmark and on an in-domain held-out evaluation set, with no artist overlap with the training set.

Conclusion, future research

MUSICGEN is an innovative model for music generation based on text or melody inputs.

It offers some advantages over previous tools being able to generate high-quality and diverse music samples using a single-stage Transformer model and efficient token interleaving patterns.

Further research is needed, the authors mentioned, because their dataset may be biased towards western-style music.

MUSICGEN may not be able to generate music that is original and creative, without plagiarizing or imitating existing music. In this context, MUSICGEN is a promising model for music generation, but it still needs further improvement and evaluation to achieve its full potential.