Tracking and Segmenting Anything in Any Modality
Tianlu Zhang, Qiang Zhang, Guiguang Ding, Jungong Han

TL;DR
This paper introduces SATA, a universal framework for tracking and segmentation across any modality, addressing cross-modal and cross-task challenges to improve generalization in video understanding.
Contribution
SATA unifies multiple tracking and segmentation tasks with any modality input using a Decoupled Mixture-of-Expert mechanism and a Task-aware Multi-object Tracking pipeline, enhancing flexibility and generalization.
Findings
Outperforms on 18 benchmarks in tracking and segmentation.
Effectively handles multiple modalities and tasks.
Improves cross-task and cross-modal knowledge sharing.
Abstract
Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation subtasks from the perspectives of any modality input or multi-task inference. However, these approaches tend to overlook two critical challenges: the distributional gap across different modalities and the feature representation gap across tasks. These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Multimodal Machine Learning Applications
