TL;DR
This paper introduces MTNet, an efficient unsupervised video object segmentation method that combines motion and appearance cues with a temporal transformer to improve accuracy and robustness across challenging scenarios.
Contribution
The paper proposes a novel framework, MTNet, that integrates motion, appearance, and long-range temporal modeling within a unified architecture for improved UVOS performance.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Effectively combines motion and appearance features.
Demonstrates robustness in diverse challenging scenarios.
Abstract
In this paper, we address the challenges in unsupervised video object segmentation (UVOS) by proposing an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. Unlike previous methods that focus solely on integrating appearance with motion or on modeling temporal relations, our method combines both aspects by integrating them within a unified framework. MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders, promoting a more complementary representation. To capture the intricate long-range contextual dynamics and information embedded within videos, a temporal transformer module is introduced, facilitating efficacious inter-frame interactions throughout a video clip. Furthermore, we employ a cascade of decoders all feature levels across all feature levels to optimally exploit the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
