MME-Unify: Welcome

A Comprehensive Benchmark for Unified Multimodal Models

Wulin Xie¹^*^, Yi-Fan Zhang^1,5^*^†^, Chaoyou Fu^2,5^, Yang Shi³^,

Bingyan Nie¹^, Hongkai Chen⁴^, Zhang Zhang¹^, Liang Wang¹^,

¹CASIA, ²NJU, ³PKU, ⁴Vivo ⁵M-M-E
* Equal Contribution † Project leader

arXiv Code

🤗

Dataset

Abstract

Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation, We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, and VILA-U, alongside specialized understanding (e.g., Claude-3.5) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively.

Visualization

Research abstract on Unified MLLMs evaluation framework

Leaderboard

Method	LLM	Date	Overall	Understanding					Generation								Unify
Task Split			Avg	SIPU	MTITU	VPU	Avg	CVIG	FIR	TIE	TIG	TVG	VP	Avg	IEE	CSQ	AL	SD	VCoT	Avg
QA pairs			1964	1200	400	364	1964	600	200	200	200	200	194	1594	200	100	52	104	90	546
Gemini2.0-flash-exp Google DeepMind	-	2025-03-12	45.6	72.6	68.3	54.9	65.2	-	77.6	43.5	57.6	-	-	29.8	38.4	74.8	47.1	26.0	12.4	40.7
MIO-Instruct Beihang University	MIO-7B	2024-09-26	37.2	52.0	33.5	39.0	41.5	51.2	59.3	43.7	48.2	51.9	66.4	53.5	24.2	38.5	8.7	11.5	0	16.6
SEED-LLaMA Tencent AI Lab	LLaMA2-Chat-13B	2023-12-18	28.5	49.2	33.0	36.3	39.5	-	57.0	42.3	42.0	-	-	23.5	22.00	51.5	12.5	22.0	3.6	22.3
Anole GAIR	-	2024-07-08	18.6	17.2	14.5	9.0	13.6	-	36.6	43.4	41.5	-	-	20.0	18.6	59.7	14.4	15.0	3.9	22.3
VILA-U Tsinghua University	LLama-7B	2024-09-06	18.6	51.0	32.3	36.5	40.0	-	-	-	45.1	49.6	-	15.8	-	-	-	-	-	-
Janus-Pro DeepSeek-AI	DeepSeek-LLM-7B-base	2025-01-29	18.1	59.6	43.5	42.2	48.4	-	-	-	35.3	-	-	5.9	-	-	-	-	-	-
MiniGPT-5 University of California	Vicuna-7B	2023-10-03	16.4	19.3	10.9	15.9	15.4	-	39.0	35.0	35.5	-	-	18.3	22.8	34.1	14.4	5.0	2.1	15.7
Janus-Flow DeepSeek-AI	DeepSeek-LLM-1.5B-base	2024-11-12	16.3	41.5	32.0	35.2	43.4	-	-	-	32.9	-	-	5.5	-	-	-	-	-	-
GILL Carnegie Mellon University	OPT-6-7B	2023-03-26	15.1	22.2	6.0	3.6	10.6	-	50.7	35.7	46.6	-	-	22.2	24.3	21.3	8.7	6.7	1.9	12.6
HermesFlow Peking University	Phi-1.5	2025-2-17	14.0	41.5	33.0	28.3	34.3	-	-	-	46.5	-	-	7.8	-	-	-	-	-	-
Emu3 BAAI	LLama-8B	2024-09-27	13.8	45.8	30.5	23.3	33.2	-	-	-	49.1	-	-	8.2	-	-	-	-	-	-
Show-o Show Lab	Phi-1.5	2024-8-22	12.7	32.5	34.6	25.7	31.0	-	-		43.5	-	-	7.3	-	-	-	-	-	-

Models are ranked according to their average performance on understanding, generation, and unify tasks, from highest to lowest.

SIPU: Single Image Perception & Understanding

TVG: Text-to-Video Generation

MTITU: Multiple & Interleaved Image-Text Understanding

VP: Video Prediction

VPU: Video Perception & Understanding

IEE: Image Editing and Explaining

CVIG: Conditional Image-to-Video Generation

CSQ: Common Sense Question Answering

FIR: Fine-grained Image Reconstruction

AL: Auxiliary Lines

TIE: Text-Guided Image Editing

SD: SpotDiff

TIG: Text-to-Image Generation

VCoT: Visual CoT

"Avg" indicates the average accuracy across subtasks in each domain.

"-" indicates that the model is unable to finish the corresponding task.

"Green date" indicates the newly added/updated models.

Benchmark

Data Examples

Diagram of MME-Unify: Our benchmark consists of 3 main domains, encompassing 15 subtasks to comprehensively evaluate U-MLLMs' understanding, generation, and unified capabilities. Specifically, each unify task includes at least one question, an input image, multiple text choices, and image choices. The image choices consist of a correct answer image and a set of manually crafted negative samples. During the evaluation process, we input the image, question, and text options, and the U-MLLMs are required to select the correct text answer and generate an image. The text answer is evaluated by matching it with the correct answer, while the generated image is compared with the constructed image choices. If the CLIP score between the generated image and the correct answer image is the highest, it is considered correct; otherwise, it is deemed incorrect.

Benchmark Statistics

A comprehensive visualization of the diverse tasks in MME-Unify: The figure illustrates the wide-ranging nature of the tasks covered in our benchmark, which spans from traditional understanding tasks to complex mixed-modality generation challenges.

Benchmark Comparison

Comparison of MME-Unify and other Benchmark.: SIPU: Single Image Perception & Understanding; MITIU: Multiple & Interleaved Image-Text Understanding; VPU: Video Perception & Understanding; CIVG: Conditional Image-to-Video Generation; FIR: Fine-grained Image Reconstruction; TIE: Text-Guided Image Editing; TIG: Text-to-Image Generation; TVG: Text-to-Video Generation; VP: Video Prediction; IEE: Image Editing and Explaining; CSQ: Common Sense Question Answering; AL: Auxiliary Lines; SD: SpotDiff; VCoT: Visual CoT.

Accuracy on Visual CoT Task

Accuracy distribution across different dimensions on Visual CoT task : (a) action, (b) location, and (c) image.

Experimental Results on All Task Splits

(1) Experimental results on various generation tasks.

(2) Experimental results on various unify tasks.


  @article{xie2025mme,
    title={MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models},
    author={Xie, Wulin and Zhang, Yi-Fan and Fu, Chaoyou and Shi, Yang and Nie, Bingyan and Chen, Hongkai and Zhang, Zhang and Wang, Liang and Tan, Tieniu},
    journal={arXiv preprint arXiv:2504.03641},
    year={2025}
  }