The big model learns to listen to music! Accurate analysis of stylistic instruments and editing and synthesis

Latest update time：2024-01-22

Reads：

Contributed by Tencent PCG ARC Laboratory
Qubits | Public account QbitAI

A large multi-modal model that can process music finally appears!

It can accurately analyze the melody, rhythm, and instruments used in the music, and even interpret the artistic conception.

And it can not only listen, but as long as you give it a piece of text and a picture, it will create it based on the text requirements after understanding the artistic conception of the picture:

Even add sound to a silent video:

It can also edit existing music, such as removing drum sounds from a piece of music

The above effects all come from the newly launched multi-modal model-based music understanding and generation framework M2UGen by Tencent PCG ARC Laboratory .

It can perform music understanding, music editing, and multi-modal music generation (text/image/video to music generation).

The research team compared the five capabilities of the model with existing models one by one, and conducted subjective evaluation experiments on three subtasks of multi-modal music generation (text/image/video to music generation), and found that M2UGen The model performance is better than existing models.

In addition, since there are not many suitable data sets for model training, the research team also developed a set of data generation methods and produced and released four data sets: MUCaps, MUEdit, MUImage, and MUVideo.

At present, the team has open sourced the model code library on Github, and opened the model weights and training data sets on Huggingface (application required).

So, how is M2UGen implemented?

The model is divided into four modules

The M2UGen model is divided into four module areas, namely multi-modal feature encoder, multi-modal understanding adapter, bridging LLM, and music understanding and generation module.

The following figure shows the overall framework of the M2UGen model:

Multimodal feature encoder

In order to achieve multi-modal music understanding and generation, the model needs to process multi-modal input.

Based on this, the research team applied some existing modal encoders, such as music encoder MERT, image encoder ViT and video encoder ViViT.

ViT and ViViT are two Transformer-based encoders commonly used in the visual field. They are often involved in some existing LLM-related work, so developers choose these two as image and video encoders respectively.

For music input, the previous work MU-LLaMA proved that the MERT model is significantly better than other compared audio/music encoders, so the research team selected MERT as the encoder for music input in M2UGen.

Multimodal understanding adapter

The main function of this module is to integrate the information of the feature vector output by the encoder and input it into the subsequent LLM to control the output of the LLM together with the text input.

As shown in the figure below, this module mainly consists of a 1D convolution layer, a linear mapping layer and a dense network module (Dense Network).

The final dense network module is shown below:

This module consists of three sub-modules, including regularization layer (Nomarlization), linear layer (Linear), activation function (SiLU) and other components.

This process can be expressed by the following formula:

Where Xi represents the output embedding vector after the i-th sub-module, Lj,i represents the j-th linear layer of the i-th sub-module, Ni represents the regularization layer within the i-th sub-module, and SiLU is the activation function.

This dense network design continues from the team’s previous work MU-LLaMA.

After the dense network, a 4096-dimensional embedding vector is output and provided to the downstream LLM.

Bridge LLM

In order to introduce multi-modal context information into LLM, researchers connect the output from adjacent upstream multi-modal understanding adapters to designated layers of LLM.

The researchers used the LLaMA 2 model developed by Meta as the base LLM, as shown in the figure below.

The model version selected here is the LLaMA 2 7B model, which contains N=32 hidden layers.

Counting from the top of the model, each L layer (L=6) introduces a modal information, and introduces music, image and video modal information from top to bottom, and uses a zero initial value attention module. The bottom (N -3L-1) layer uses the original attention module.

The text instructions of LLM are input from the bottom layer, that is, the first layer. Using this technology, LLM is given the ability to guide the LLM output through other modal information.

Music understanding and generation module

Inspired by the NExT-GPT work, this model introduces a specific audio tag [AUD] to distinguish music question answering and generation tasks.

During the model training phase, for training sample pairs (such as text instruction-music pairs) with music as the output (i.e., music generation task), these audio markers will be added at the end of the LLM output to indicate the downstream music output.

In the model inference phase, if the user input instructions are related to music generation, such as Generate a music using flute (generate a piece of music with flute), the output of LLM will contain audio tags, so the downstream music decoder will receive the instruction and generate music with the flute as an instrument;

On the other hand, if the output of LLM does not have audio tags, it indicates that the user expects a music understanding task, and LLM will directly respond to the user's question.

The researchers tried two music decoders-AudioLDM 2 and MusicGen. MusicGen's music generation performance was better than AudioLDM 2.

A new data set is proposed, and the training is divided into three stages

training data set

As stated in the research contribution of this article, this study constructed four sets of data sets, MUCaps, MUEdit, MUImage and MUVideo. Examples of data samples are shown in the figure below.

MUCaps dataset :

About 1,200 hours of public music from AudioSet and some websites;
The MU-LLaMA model is used to generate music annotations for the collected music files to form music-text pairs.

MUEdit dataset :

Build a music pool from AudioSet (music pool is different from MUCaps) and filter out about 60 hours of similar music-music pairs;
The filtering conditions include speed (tempo), beats (beats), etc., so as to obtain music-music pairs that are generally similar but have certain differences (for example, different instruments are used);
Consider the music-music pair as a source-target pair, input the annotation text of the source music to the MPT-7B [14] model to obtain the human-side dialogue, and input the annotation text of the target music to the MPT-7B model to obtain the model-side dialogue, also That is, both the source music and the target music receive corresponding instructions for model training.

MUImage/MUVideo data set :

Then sample some image/video-music pairs from the AudioSet (different from the music in MUCaps/MUEdit, minimizing music duplication), and use the BLIP/VideoMAE model to annotate the images/videos;
Input the annotation text of the image/video + music into the MPT-7B model to obtain human-side and model-side dialogues respectively.

The above data set construction script can be found at:
https://github.com/shansongliu/M2UGen/tree/main/Datasets

The training of the M2UGen model refers to the training method of NExT-GPT, which is divided into three stages, namely encoding-side training, decoding-side training and codec-side joint training.

Phase 1: Coder-side training

This stage freezes the multimodal encoder and LLM, and only trains the multimodal understanding adapter;

Use music/image/video-text pairs from MUCaps/COCO/MUVideo for stage 1 training;

The training loss is cross-entropy loss, which compares the output of LLM with the target annotation text.

Stage 2: Decoder training

At this stage, the coding test (modal encoder and adapter) is not considered, the LLM is frozen, and the output mapping module is trained;

This stage aims to train the LLM to generate instructions that instruct the downstream music decoder to output music, or to directly make questions and answers or annotations on the input music based on the input instructions;

What needs to be aligned is the text encoder output of the music decoder (AudioLDM 2/MusicGen) and the conditional embedding vector generated by the M2UGen model output mapping module, that is, aligning the output ends;

During training at this stage, a specific audio tag [AUD] is added to indicate whether to generate music. If the output of LLM contains [AUD], text + music (music generation) will be generated at the same time, if not, only text (music question and answer) will be generated;

The loss function uses cross entropy and mean square error, where the cross entropy is to compare the audio tokens output by the LLM and the ground truth audio tokens, and the mean square error is to compare the conditional embedding vector produced by the M2UGen model output mapping module and the text encoder of the music decoder. Output text embedding vector.

Stage 3: Encoding and decoding joint training

This stage freezes the multimodal encoder and LLM, trains the multimodal understanding adapter and output mapping module, and LoRA parameters in LLM;

The training data at this stage include Alpaca (general knowledge), MusicQA, MUImage, MUVideo and MUEdit;

In order to enable the model to generate music and text at the same time, the three data sets MUImage, MUVideo and MUEdit added specific audio tags to the LLM output during stage 3 training (similar to stage 2 training).

In the future, the research team's work will focus on further improving the model's fine-grained music understanding capabilities, improving the correlation between generated music and input instructions, and making the music editing capabilities more accurate.

Paper address: https://arxiv.org/pdf/2311.11255.pdf

-over-

Click here ???? Follow me and remember to star~