Interactive Image Segmentation with Cross-Modality Vision Transformers
Kun Li, George Vosselman, Michael Ying Yang

TL;DR
This paper introduces a cross-modality vision transformer for interactive image segmentation that leverages mutual information between modalities, achieving superior performance and stability over previous models.
Contribution
It proposes a novel cross-modality transformer architecture that effectively models relations between different data modalities for interactive segmentation.
Findings
Outperforms previous state-of-the-art models on several benchmarks.
Demonstrates high stability and reduced failure cases.
Shows potential as a practical annotation tool.
Abstract
Interactive image segmentation aims to segment the target from the background with the manual guidance, which takes as input multimodal data such as images, clicks, scribbles, and bounding boxes. Recently, vision transformers have achieved a great success in several downstream visual tasks, and a few efforts have been made to bring this powerful architecture to interactive segmentation task. However, the previous works neglect the relations between two modalities and directly mock the way of processing purely visual information with self-attentions. In this paper, we propose a simple yet effective network for click-based interactive segmentation with cross-modality vision transformers. Cross-modality transformers exploits mutual information to better guide the learning process. The experiments on several benchmarks show that the proposed method achieves superior performance in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection
