VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement
Hanjung Kim, Jaehyun Kang, Miran Heo, Sukjun Hwang, Seoung Wug Oh,, Seon Joo Kim

TL;DR
VISAGE introduces appearance-guided enhancements to improve object association in video instance segmentation, addressing the over-reliance on location cues and achieving state-of-the-art results on multiple benchmarks.
Contribution
The paper proposes a novel extension to object decoders that explicitly captures appearance features, significantly improving tracking accuracy in VIS tasks.
Findings
Achieves state-of-the-art results on YouTube-VIS 2019/2021 and OVIS datasets.
Constructs a synthetic dataset to evaluate appearance awareness.
Enhances object association by explicitly modeling appearance features.
Abstract
In recent years, online Video Instance Segmentation (VIS) methods have shown remarkable advancement with their powerful query-based detectors. Utilizing the output queries of the detector at the frame-level, these methods achieve high accuracy on challenging benchmarks. However, our observations demonstrate that these methods heavily rely on location information, which often causes incorrect associations between objects. This paper presents that a key axis of object matching in trackers is appearance information, which becomes greatly instructive under conditions where positional cues are insufficient for distinguishing their identities. Therefore, we suggest a simple yet powerful extension to object decoders that explicitly extract embeddings from backbone features and drive queries to capture the appearances of objects, which greatly enhances instance association accuracy.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods
