ViewSAM: Learning View-aware Cross-modal Semantics for Weakly Supervised Cross-view Referring Multi-Object Tracking
Jiawei Ge, Xintian Zhang, Jiuxin Cao, Bo Liu, Fabian Deuser, Chang Liu, Gong Wenkang, Siyou Li, Juexi Shao, Wenqing Wu, Chen Feng, Ioannis Patras

TL;DR
ViewSAM introduces a weakly supervised framework for cross-view referring multi-object tracking, leveraging foundation models as pseudo-label generators and explicitly modeling view-aware semantics to achieve state-of-the-art results.
Contribution
The paper proposes a novel two-stage weakly supervised CRMOT framework using foundation models for pseudo-labeling and view-aware semantic modeling, reducing reliance on costly annotations.
Findings
ViewSAM achieves state-of-the-art performance under weak supervision.
Foundation models can effectively generate pseudo labels for CRMOT.
ViewSAM remains competitive with fully supervised methods.
Abstract
Cross-view Referring Multi-Object Tracking (CRMOT) aims to track multiple objects specified by natural language across multiple camera views, with globally consistent identities. Despite recent progress, existing methods rely heavily on costly frame-level spatial annotations and cross-view identity supervision. To reduce such reliance, we explore CRMOT under weak supervision by leveraging the capabilities of foundation models. However, our empirical study shows that directly applying foundation models such as SAM2 and SAM3, even with task-specific modifications, fails to accurately understand referring expressions and maintain consistent identities across views. Yet, they remain effective at producing reliable object tracklets that can serve as pseudo supervision. We therefore repurpose foundation models as pseudo-label generators and propose a two-stage framework for weakly supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
