Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive

You Huang; Lichao Chen; Jiayi Ji; Liujuan Cao; Shengchuan Zhang; Rongrong Ji

arXiv:2507.09612·cs.CV·July 15, 2025

Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive

You Huang, Lichao Chen, Jiayi Ji, Liujuan Cao, Shengchuan Zhang, Rongrong Ji

PDF

Open Access

TL;DR

Inter2Former introduces a dynamic hybrid attention framework that optimizes dense-token processing for interactive segmentation, achieving high accuracy and efficiency on CPU devices by adaptively focusing computation on regions of interest.

Contribution

The paper presents a novel adaptive computation framework with Dynamic Prompt Embedding, Hybrid Attention, Mixture of Experts, and Local Upsampling for efficient high-precision interactive segmentation.

Findings

01

Achieves state-of-the-art performance on high-precision IS benchmarks.

02

Operates efficiently on CPU devices with high segmentation quality.

03

Utilizes adaptive attention and computation strategies to focus on relevant regions.

Abstract

Interactive segmentation (IS) improves annotation efficiency by segmenting target regions from user prompts, with widespread applications in real-world scenarios. Current approaches face a critical trade-off: dense-token methods achieve superior accuracy and detail preservation but suffer from prohibitively slow processing on CPU devices, while the Segment Anything Model (SAM) advances the field with sparse prompt tokens for fast inference but compromises segmentation quality. In this paper, we propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing, which introduces four key enhancements. First, we propose Dynamic Prompt Embedding (DPE) that adaptively processes only regions of interest while avoiding additional overhead from background tokens. Second, we introduce Dynamic Hybrid Attention (DHA), which leverages previous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization