ViewSAM: Learning View-aware Cross-modal Semantics for Weakly Supervised Cross-view Referring Multi-Object Tracking

Jiawei Ge; Xintian Zhang; Jiuxin Cao; Bo Liu; Fabian Deuser; Chang Liu; Gong Wenkang; Siyou Li; Juexi Shao; Wenqing Wu; Chen Feng; Ioannis Patras

arXiv:2605.02638·cs.CV·May 5, 2026

ViewSAM: Learning View-aware Cross-modal Semantics for Weakly Supervised Cross-view Referring Multi-Object Tracking

Jiawei Ge, Xintian Zhang, Jiuxin Cao, Bo Liu, Fabian Deuser, Chang Liu, Gong Wenkang, Siyou Li, Juexi Shao, Wenqing Wu, Chen Feng, Ioannis Patras

PDF

TL;DR

ViewSAM introduces a weakly supervised framework for cross-view referring multi-object tracking, leveraging foundation models as pseudo-label generators and explicitly modeling view-aware semantics to achieve state-of-the-art results.

Contribution

The paper proposes a novel two-stage weakly supervised CRMOT framework using foundation models for pseudo-labeling and view-aware semantic modeling, reducing reliance on costly annotations.

Findings

01

ViewSAM achieves state-of-the-art performance under weak supervision.

02

Foundation models can effectively generate pseudo labels for CRMOT.

03

ViewSAM remains competitive with fully supervised methods.

Abstract

Cross-view Referring Multi-Object Tracking (CRMOT) aims to track multiple objects specified by natural language across multiple camera views, with globally consistent identities. Despite recent progress, existing methods rely heavily on costly frame-level spatial annotations and cross-view identity supervision. To reduce such reliance, we explore CRMOT under weak supervision by leveraging the capabilities of foundation models. However, our empirical study shows that directly applying foundation models such as SAM2 and SAM3, even with task-specific modifications, fails to accurately understand referring expressions and maintain consistent identities across views. Yet, they remain effective at producing reliable object tracklets that can serve as pseudo supervision. We therefore repurpose foundation models as pseudo-label generators and propose a two-stage framework for weakly supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.