LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

Zhang Li; Biao Yang; Qiang Liu; Shuo Zhang; Zhiyin Ma; Liang Yin; Linger Deng; Yabo Sun; Yuliang Liu; Xiang Bai

arXiv:2507.06272·cs.CV·August 12, 2025

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

Zhang Li, Biao Yang, Qiang Liu, Shuo Zhang, Zhiyin Ma, Liang Yin, Linger Deng, Yabo Sun, Yuliang Liu, Xiang Bai

PDF

1 Models

TL;DR

LIRA enhances large multi-modal models' segmentation and comprehension by integrating semantic features and local supervision, significantly reducing errors and hallucinations in visual understanding.

Contribution

The paper introduces LIRA, a novel framework combining semantic-enhanced feature extraction and local visual coupling to improve segmentation accuracy and reduce hallucinations in multi-modal models.

Findings

01

LIRA achieves state-of-the-art segmentation performance.

02

LIRA significantly reduces hallucinated comprehension errors.

03

The Attributes Evaluation dataset quantifies semantic inference ability.

Abstract

While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
echo840/LIRA
model· 4 dl· ♡ 1
4 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.