TL;DR
STORM is an end-to-end multi-object tracking framework that jointly performs grounding and tracking, leveraging a new dataset and a task-composition learning strategy to improve performance and data efficiency.
Contribution
The paper introduces STORM, a unified model for referring multi-object tracking, and STORM-Bench, a new dataset with accurate trajectories and diverse referring expressions.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Demonstrates strong generalization and robust spatial-temporal reasoning.
Leverages a task-composition learning strategy for improved data efficiency.
Abstract
Referring multi-object tracking (RMOT) is a task of associating all the objects in a video that semantically match with given textual queries or referring expressions. Existing RMOT approaches decompose object grounding and tracking into separated modules and exhibit limited performance due to the scarcity of training videos, ambiguous annotations, and restricted domains. In this work, we introduce STORM, an end-to-end MLLM that jointly performs grounding and tracking within a unified framework, eliminating external detectors and enabling coherent reasoning over appearance, motion, and language. To improve data efficiency, we propose a task-composition learning (TCL) strategy that decomposes RMOT into image grounding and object tracking, allowing STORM to leverage data-rich sub-tasks and learn structured spatial--temporal reasoning. We further construct STORM-Bench, a new RMOT dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
