DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal   Perception

Xiaotong Li; Fan Zhang; Haiwen Diao; Yueze Wang; Xinlong Wang; Ling-Yu; Duan

arXiv:2407.08303·cs.CV·November 26, 2024·1 cites

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, Ling-Yu, Duan

PDF

Open Access 1 Repo 1 Datasets

TL;DR

DenseFusion-1M introduces a large, high-quality dataset with dense image descriptions generated by a novel caption engine, significantly enhancing multimodal large language models' ability to understand complex visual elements.

Contribution

The paper presents DenseFusion-1M, a new dataset created with a cost-effective caption engine that improves MLLMs' perception of detailed visual information.

Findings

01

Outperforms existing caption engines in generating dense, accurate descriptions.

02

Enhances MLLMs' performance on diverse vision-language benchmarks.

03

Enables better understanding of high-resolution images in multimodal models.

Abstract

Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex understanding of various visual elements, including multiple objects, text information, and spatial relations. Their development for comprehensive visual perception hinges on the availability of high-quality image-text datasets that offer diverse visual elements and throughout image descriptions. However, the scarcity of such hyper-detailed datasets currently hinders progress within the MLLM community. The bottleneck stems from the limited perceptual capabilities of current caption engines, which fall short in providing complete and accurate annotations. To facilitate the cutting-edge research of MLLMs on comprehensive vision perception, we thereby propose Perceptual Fusion, using a low-budget but highly effective caption engine for complete and accurate image descriptions. Specifically, Perceptual Fusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

baaivision/densefusion
pytorchOfficial

Datasets

BAAI/DenseFusion-1M
dataset· 3.6k dl
3.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques