Interactive Image Segmentation with Cross-Modality Vision Transformers

Kun Li; George Vosselman; Michael Ying Yang

arXiv:2307.02280·cs.CV·July 6, 2023·1 cites

Interactive Image Segmentation with Cross-Modality Vision Transformers

Kun Li, George Vosselman, Michael Ying Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a cross-modality vision transformer for interactive image segmentation that leverages mutual information between modalities, achieving superior performance and stability over previous models.

Contribution

It proposes a novel cross-modality transformer architecture that effectively models relations between different data modalities for interactive segmentation.

Findings

01

Outperforms previous state-of-the-art models on several benchmarks.

02

Demonstrates high stability and reduced failure cases.

03

Shows potential as a practical annotation tool.

Abstract

Interactive image segmentation aims to segment the target from the background with the manual guidance, which takes as input multimodal data such as images, clicks, scribbles, and bounding boxes. Recently, vision transformers have achieved a great success in several downstream visual tasks, and a few efforts have been made to bring this powerful architecture to interactive segmentation task. However, the previous works neglect the relations between two modalities and directly mock the way of processing purely visual information with self-attentions. In this paper, we propose a simple yet effective network for click-based interactive segmentation with cross-modality vision transformers. Cross-modality transformers exploits mutual information to better guide the learning process. The experiments on several benchmarks show that the proposed method achieves superior performance in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lik1996/icmformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection