TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut
Yangtao Wang (M-PSI), Xi Shen, Yuan Yuan (MIT CSAIL), Yuming Du,, Maomao Li, Shell Xu Hu, James L Crowley (M-PSI), Dominique Vaufreydaz (M-PSI)

TL;DR
TokenCut introduces a graph-based, self-supervised transformer approach for object segmentation in images and videos, achieving state-of-the-art results across multiple datasets without supervision.
Contribution
The paper presents a novel graph-cut based method using transformer features for unsupervised object segmentation, outperforming existing approaches.
Findings
Outperforms competing methods on VOC07, VOC12, and COCO20K datasets.
Improves IoU scores on ECSSD, DUTS, and DUT-OMRON datasets.
Achieves competitive results in unsupervised video object segmentation.
Abstract
In this paper, we describe a graph-based algorithm that uses the features obtained by a self-supervised transformer to detect and segment salient objects in images and videos. With this approach, the image patches that compose an image or video are organised into a fully connected graph, where the edge between each pair of patches is labeled with a similarity score between patches using features learned by the transformer. Detection and segmentation of salient objects is then formulated as a graph-cut problem and solved using the classical Normalized Cut algorithm. Despite the simplicity of this approach, it achieves state-of-the-art results on several common image and video detection and segmentation tasks. For unsupervised object discovery, this approach outperforms the competing approaches by a margin of 6.1%, 5.7%, and 2.6%, respectively, when tested with the VOC07, VOC12, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
