Training-Free Semantic Multi-Object Tracking with Vision-Language Models

Laurence Bonat; Francesco Tonini; Elisa Ricci; Lorenzo Vaquero

arXiv:2604.14074·cs.CV·April 16, 2026

Training-Free Semantic Multi-Object Tracking with Vision-Language Models

Laurence Bonat, Francesco Tonini, Elisa Ricci, Lorenzo Vaquero

PDF

TL;DR

This paper introduces TF-SMOT, a training-free pipeline for semantic multi-object tracking that leverages pretrained models for detection, segmentation, and language generation to produce human-interpretable scene descriptions.

Contribution

TF-SMOT combines existing pretrained components into a novel, training-free framework for semantic multi-object tracking, enabling rapid adaptation and improved performance without additional training.

Findings

01

Achieves state-of-the-art tracking performance on BenSMOT dataset.

02

Improves video summary and caption quality over prior methods.

03

Interaction recognition remains challenging with fine-grained labels.

Abstract

Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.