DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, Ling-Yu, Duan

TL;DR
DenseFusion-1M introduces a large, high-quality dataset with dense image descriptions generated by a novel caption engine, significantly enhancing multimodal large language models' ability to understand complex visual elements.
Contribution
The paper presents DenseFusion-1M, a new dataset created with a cost-effective caption engine that improves MLLMs' perception of detailed visual information.
Findings
Outperforms existing caption engines in generating dense, accurate descriptions.
Enhances MLLMs' performance on diverse vision-language benchmarks.
Enables better understanding of high-resolution images in multimodal models.
Abstract
Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex understanding of various visual elements, including multiple objects, text information, and spatial relations. Their development for comprehensive visual perception hinges on the availability of high-quality image-text datasets that offer diverse visual elements and throughout image descriptions. However, the scarcity of such hyper-detailed datasets currently hinders progress within the MLLM community. The bottleneck stems from the limited perceptual capabilities of current caption engines, which fall short in providing complete and accurate annotations. To facilitate the cutting-edge research of MLLMs on comprehensive vision perception, we thereby propose Perceptual Fusion, using a low-budget but highly effective caption engine for complete and accurate image descriptions. Specifically, Perceptual Fusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
