Training-Free Semantic Multi-Object Tracking with Vision-Language Models
Laurence Bonat, Francesco Tonini, Elisa Ricci, Lorenzo Vaquero

TL;DR
This paper introduces TF-SMOT, a training-free pipeline for semantic multi-object tracking that leverages pretrained models for detection, segmentation, and language generation to produce human-interpretable scene descriptions.
Contribution
TF-SMOT combines existing pretrained components into a novel, training-free framework for semantic multi-object tracking, enabling rapid adaptation and improved performance without additional training.
Findings
Achieves state-of-the-art tracking performance on BenSMOT dataset.
Improves video summary and caption quality over prior methods.
Interaction recognition remains challenging with fine-grained labels.
Abstract
Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
