STORM: End-to-End Referring Multi-Object Tracking in Videos

Zijia Lu; Jingru Yi; Jue Wang; Yuxiao Chen; Junwen Chen; Xinyu Li; Davide Modolo

arXiv:2604.10527·cs.CV·April 14, 2026

STORM: End-to-End Referring Multi-Object Tracking in Videos

Zijia Lu, Jingru Yi, Jue Wang, Yuxiao Chen, Junwen Chen, Xinyu Li, Davide Modolo

PDF

1 Repo

TL;DR

STORM is an end-to-end multi-object tracking framework that jointly performs grounding and tracking, leveraging a new dataset and a task-composition learning strategy to improve performance and data efficiency.

Contribution

The paper introduces STORM, a unified model for referring multi-object tracking, and STORM-Bench, a new dataset with accurate trajectories and diverse referring expressions.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Demonstrates strong generalization and robust spatial-temporal reasoning.

03

Leverages a task-composition learning strategy for improved data efficiency.

Abstract

Referring multi-object tracking (RMOT) is a task of associating all the objects in a video that semantically match with given textual queries or referring expressions. Existing RMOT approaches decompose object grounding and tracking into separated modules and exhibit limited performance due to the scarcity of training videos, ambiguous annotations, and restricted domains. In this work, we introduce STORM, an end-to-end MLLM that jointly performs grounding and tracking within a unified framework, eliminating external detectors and enabling coherent reasoning over appearance, motion, and language. To improve data efficiency, we propose a task-composition learning (TCL) strategy that decomposes RMOT into image grounding and object tracking, allowing STORM to leverage data-rich sub-tasks and learn structured spatial--temporal reasoning. We further construct STORM-Bench, a new RMOT dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/storm-referring-multi-object-grounding
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.