DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
Run Luo, Yunshui Li, Longze Chen, Wanwei He, Ting-En Lin, Ziqiang Liu,, Lei Zhang, Zikai Song, Xiaobo Xia, Tongliang Liu, Min Yang, Binyuan Hui

TL;DR
DEEM enhances large multimodal models' image perception by leveraging diffusion models' generative feedback, improving robustness against out-of-distribution data and reducing hallucinations without additional training modules.
Contribution
The paper introduces DEEM, a novel approach that uses diffusion models to improve the semantic alignment of image encoders in large multimodal models, addressing out-of-distribution challenges.
Findings
DEEM achieves up to 12.8% improvement on the POPE benchmark.
DEEM reduces visual hallucinations and enhances perception robustness.
Fewer training parameters and less data are needed compared to state-of-the-art methods.
Abstract
The development of large language models (LLMs) has significantly advanced the emergence of large multimodal models (LMMs). While LMMs have achieved tremendous success by promoting the synergy between multimodal comprehension and creation, they often face challenges when confronted with out-of-distribution data, such as which can hardly distinguish orientation, quantity, color, structure, etc. This is primarily due to their reliance on image encoders trained to encode images into task-relevant features, which may lead them to disregard irrelevant details. Delving into the modeling capabilities of diffusion models for images naturally prompts the question: Can diffusion models serve as the eyes of large language models for image perception? In this paper, we propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsBalanced Selection · ALIGN · Diffusion
