TL;DR
LIRA enhances large multi-modal models' segmentation and comprehension by integrating semantic features and local supervision, significantly reducing errors and hallucinations in visual understanding.
Contribution
The paper introduces LIRA, a novel framework combining semantic-enhanced feature extraction and local visual coupling to improve segmentation accuracy and reduce hallucinations in multi-modal models.
Findings
LIRA achieves state-of-the-art segmentation performance.
LIRA significantly reduces hallucinated comprehension errors.
The Attributes Evaluation dataset quantifies semantic inference ability.
Abstract
While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
