Llama can also generate images! HKU Byte launched an open source autoregressive text graph model, and the online experience is now open

Latest update time：2024-07-03

Reads：

LlamaGen team contribution
Quantum Bit | Public Account QbitAI

With only Image Tokenizer, Llama can also generate images, and the effect exceeds that of the diffusion model.

Researchers from HKU and ByteDance proposed an image generation method based on the autoregressive model Llama.

The model has been open sourced and has received nearly 900 stars on GitHub.

After the emergence of the diffusion model, it replaced the autoregressive method and became the mainstream technical route for image generation.

However, on the ImageNet test benchmark, the performance of LlamaGen proposed by the author surpassed diffusion models such as LDM and DiT .

The author's discovery proves that the most original autoregressive model architecture can also achieve highly competitive image generation performance.

△ LlamaGen raw image example, the first line is class control generation, the second line is text generation

So, how is image generation based on the autoregressive model, or Llama, achieved?

Image generation using autoregressive models

The author introduces that the open source community's impression of autoregressive models for image generation mostly remains on the FID score of around 15 achieved by VQ-GAN on the ImageNet benchmark in 2020.

However, as early as 2021, ViT-VQGAN had achieved a performance of around FID 3.0, and DALL-E 1, Parti, etc. have shown great potential in the field of literary images.

However, none of these works were open source, so the research team set the goal of launching an open source version of the autoregressive image generation model.

For the existing advanced image generation model, the author summarizes the three key designs for its success:

Image Compressors/Tokenizers
Scalable Image generation models
High-quality Training Data

Therefore, the author adopted the same CNN architecture as VQ-GAN to convert continuous images into discrete Tokens.

Compared with VQ-GAN in 2020, the author has a better understanding of Image Tokenizer:

A good Tokenizer requires a larger Codebook Size and a lower Codebook Vector Dimension. At the same time, better image reconstruction requires more Tokens.

△ VQ-GAN architecture, not this project

In terms of architecture, LlamaGen's model architecture is mainly based on the Llama language model, including Pre-Normalization, SwiGLU and RoPE using RMSNorm.

Although some commonly used techniques in the field of image generation (such as AdaLN) may further improve performance, the author tries to keep the architecture exactly the same as the Llama language model.

In the Class-Conditional and Text-Conditional image generation models, the author used the simplest implementation:

Class or text embedding is directly used as the starting Token, and the subsequent Image Token is generated using the next-Token prediction paradigm.

The training process is divided into two stages.

In the first stage, the model is trained on the 50M subset of LAION-COCO with image resolution of 256×256.

The original LAION-COCO dataset has 600 million image-text pairs, and the authors screened these images by valid image URLs, aesthetic scores, watermark scores, CLIP image-text similarity scores, and image size.

In the second stage, the model is fine-tuned on a scale of 10 million internal high aesthetic quality images with an image resolution of 512×512.

The text descriptions of these aesthetic images are generated by LLaVa.

At the deployment stage, the image generation model based on the native autoregressive model architecture can seamlessly adopt the existing LLM deployment framework, such as vLLM. This is also a major advantage of the unified model architecture.

At the same time, the vLLM-based framework deployment method brought 326%-414% acceleration to LlamaGen.

The effect is not inferior to the diffusion model

So, what is the effect of the model developed by the author?

Let’s start with the author’s retrained Image Tokenizer, which outperforms previous Tokenizers on ImageNet and COCO, including VQGAN, ViT-VQGAN, and MaskGI.

Importantly, the performance of the Tokenizer based on discrete representations is on par with that of VAEs based on continuous representations (such as the SD VAE widely used in diffusion models) , which indicates that the discrete representation of image quantization is no longer a major bottleneck for image reconstruction.

In the actual generation process, on the ImageNet test set, LlamaGen showed extremely strong competitiveness in indicators such as FID, IS, Precision and Recall.

Among them, the LlamaGen-3B model outperforms the popular diffusion models LDM and DiT. This shows that the simplest autoregressive model architecture has the ability to serve as the basic model for advanced image generation systems.

At the same time, compared with previous autoregressive models, LlamaGen outperforms previous models in all parameter magnitudes.

The author analyzed that this achievement is due to the better scalability of the Image Tokenizer and Llama architecture.

In terms of text-to-image, after the first stage of training, the model basically has the ability to align text and images, but the visual quality of the generated images needs to be improved.

第二阶段的训练显著提高了生成图像的视觉质量，作者认为这种提高来自两个方面——

The second stage of training used high-quality aesthetic images;
The image resolution of the first stage is 256x256, and that of the second stage is 512x512. A larger image resolution will bring better visual effects.

When inputting longer text, LlamaGen can also generate images with both text and image alignment and visual quality.

However, the author also admitted that if we compare the development path of the diffusion model, the current LlamaGen has only reached the Stable Diffusion v1 stage. Future improvement directions include SDXL (larger resolution, more Aspect Ratio), ControlNet (higher controllability), and Sora (video generation).

From the perspective of multimodal large models, it has been proven feasible for autoregressive models to achieve both understanding tasks and generation tasks respectively. The next step is to jointly train them in the same model.

The project is now open source and supports online experience. If you are interested, you may want to give it a try.

Online experience:
https://huggingface.co/spaces/FoundationVision/LlamaGen
Paper address:
https://arxiv.org/abs/2406.06525
Project homepage:
https://peizesun.github.io/llamagen/
GitHub:
https://github.com/FoundationVision/LlamaGen
Hugging Face:
https://huggingface.co/FoundationVision/LlamaGen