Unified image generation, no need for complicated plug-ins! Zhiyuan releases diffusion model framework OmniGen
Yunzhong sent from Aofei Temple
Quantum Bit | Public Account QbitAI
Multimodal models,unify image generation.
The latest diffusion model framework is here.
The Academy of Artificial Intelligence launched OmniGen , which can naturally support various image generation tasks, has a highly simplified architecture, and can effectively transfer knowledge across different tasks to cope with unprecedented tasks and fields.
Features are as follows:
1. Unity: OmniGen naturally supports various image generation tasks, such as text-based images, image editing, theme-driven generation, and visual conditional generation. In addition, OmniGen can handle classic computer vision tasks and convert them into image generation tasks.
2. Simplicity: OmniGen's architecture is highly simplified. In addition, compared with existing models, it is more user-friendly and can complete complex tasks through instructions without lengthy processing steps and additional modules (such as ControlNet or IP-Adapter), thus greatly simplifying the workflow.
3. Knowledge transfer: Benefiting from learning in a unified format, OmniGen effectively transfers knowledge across different tasks, copes with unseen tasks and domains, and demonstrates novel features. We also explore the potential application of the model's reasoning ability and thought chain mechanism in the field of image generation.
Based on OmniGen's general capabilities , more flexible image generation can be implemented. The following shows a simple pipeline: text generates images, edits some elements of the generated image, generates a redrawn image based on the human posture of the generated image, and extracts the required object from another image and merges it with the new image.
AI image generation plug-in is cumbersome to operate
In recent years, many text-to-image models have emerged in the wave of generative AI. However, these excellent proprietary models can only generate images based on text. When users have more flexible, complex, and sophisticated image generation requirements, additional plug-ins and operations are often required.
For example, if you want to generate an image with reference to any posture, the conventional method is: use the posture detector to estimate the posture from the reference image as the conditional input, load the corresponding Controlnet plug-in, and finally extract the features of the conditional input and feed it into the diffusion model to generate the image .
In addition, if you want to generate a new image based on a specific person in a group photo, the process is more cumbersome and you need to crop the image to ensure that the resulting image only contains the target person.
Methods such as InstandID require the use of an additional face detector to extract facial information and a facial encoder to extract features for input into the model.
It is worth noting that various generation tasks require even more different plug-ins and operations. Such a complex, trivial and lengthy workflow greatly increases the cost of training and application. However, even with such cumbersomeness, it is sometimes still difficult to meet the needs of general image generation, such as generating new images based on entities in multiple specified photos.
On the contrary, in the field of text generation, models represented by ChatGPT can directly handle various text tasks through human instructions. So, in the field of image generation, can a single model that supports multiple inputs and couples multiple capabilities complete various generation tasks based on user instructions without various complicated processes?
To solve this challenging problem, Zhiyuan released the unified image generation model OmniGen . The OmniGen model has good simplicity and ease of use, and integrates a variety of basic image generation tasks, including but not limited to: text image, image editing, character consistency generation, generation based on visual conditions, etc.
OmniGen supports completing tasks based on any multi-modal text and image instructions without any other additional plug-ins and operations.
Strong general capabilities
OmniGen integrates multiple capabilities, including but not limited to:
-
Text to Image Generation
-
Referring Expression Generation
-
General Image Conditional Generation
-
Image Edit
-
Classic computer vision tasks: image denoising, edge detection, pose estimation, etc.
-
Certain in-context learning capabilities
The following is a brief summary of some of the capabilities:
2.1 Text to Image Generation
2.2 Generation of referential expressions
OmniGen has the ability to generate role-consistent images similar to models such as InstandID and Pulid, that is, it inputs an image with a single object, understands and follows instructions, and outputs a new image based on the object.
At the same time, OmniGen has a higher-order ability: referential expression generation, which we define as the ability to identify the object referred to by the instruction from an image containing multiple objects and generate a new image.
For example, OmniGen can directly locate the target object from a multi-person image according to the instruction and generate a new image that follows the instruction without any additional modules and operations:
More examples:
2.3 General Image Condition Generation
OmniGen not only supports the ability to generate images based on specific explicit conditions similar to ControlNet, but also has the ability to handle classic computer vision tasks (such as human pose estimation, depth estimation, etc.).
Therefore, OmniGen can complete the entire ControlNet process with a single model: directly use OmniGen to extract visual conditions from the original image and generate images based on the extracted conditions without the need for additional processors.
At the same time, OmniGen can further simplify the intermediate process and output the image in one step: directly input the original image and enter the command "Following the human pose (or depth mapping) of this image, generate a new image:...", and a new image can be generated based on the human pose or depth map relationship of the input image.
2.4 Image Editing
OmniGen has good image editing capabilities and can execute multiple editing commands simultaneously in one run, for example:
2.5 More capabilities
OmniGen has latent reasoning capabilities and can handle non-explicit query instructions that require certain model understanding and inference capabilities.
For example, if the model is asked to delete the object that can hold water in the picture, the model can understand and infer the object in the picture involved in the instruction and delete it:
On the other hand, OmniGen has a certain degree of contextual learning ability and can process images based on reference examples. For example, given an input-output pairing example of segmenting a queen chess piece (Example), the model can recognize and segment the corresponding object in the new input image:
The Chain-of-Thought (CoT) approach significantly improves the performance of LLM by breaking down the task into multiple steps and solving each step sequentially to obtain an accurate final answer. We considered whether a similar alternative could be applied to image generation. Inspired by the basic way humans paint, we hope to imitate the step-by-step painting process and iteratively generate images from a blank canvas. We conducted preliminary explorations and after fine-tuning, the model was able to simulate human behavior to generate pictures step by step. Further optimization is left to future research.
OmniGen's capabilities include but are not limited to the above, as well as basic image denoising, edge extraction, etc. The model weights and code have been open sourced, and users can explore more of OmniGen's capabilities on their own.
Discard extra modules as much as possible
OmniGen's core design principles are: simplicity and effectiveness.
Therefore, the research team discarded various additional modules to the greatest extent possible. The basic architecture of OmniGen is: a Transformer model and a VAE module, with a total of 3.8B parameters. Among them, the Transformer inherits the Phi3-mini model, and uses bidirectional attention inside the image to match the characteristics of image data. The overall architecture is as follows:
To achieve strong generalization and generalization capabilities, researchers need to train models based on large-scale and diverse datasets. However, in the field of image generation, there is no universal dataset available. To this end, we built the first large-scale and diverse unified image generation dataset X2I, which means "Anything to Image". Among them, the data formats of different tasks are reorganized and unified for easy management and use. The X2I dataset contains about 100 million images and will be open source after review and other processes in the future, aiming to further promote the development of the field of general image generation. The following figure briefly shows some examples of the X2I dataset:
In short, OmniGen's unified image generation paradigm not only helps to perform various downstream tasks, but also facilitates the combination of various capabilities to meet more general needs. Currently, OmniGen's reports, weights, and codes have been open sourced, and the community is welcome to participate in the exploration of OmniGen's potential capabilities, basic performance improvements, and wide application.
The OmniGen model is a preliminary attempt at unified image generation, and there is still a lot of room for improvement. In the future, Zhiyuan will further improve the basic capabilities of the model and expand more interesting functions. At the same time, the fine-tuning code has been released , and users can simply fine-tune it. Since OmniGen's input form is very diverse, users can define a variety of fine-tuning tasks by themselves, giving the model more interesting capabilities.
Related Links:
Paper: https://arxiv.org/pdf/2409.11340
Code: https://github.com/VectorSpaceLab/OmniGen
Demo: https://huggingface.co/spaces/Shitao/OmniGen
*This article is authorized to be published by Quantum位, and the views expressed are solely those of the author.
-over-
Quantum bit QbitAI
Tracking new trends in AI technology and products