STORM: Segment, Track, and Object Re-Localization from a Single Image
Yu Deng, Teng Cao, Hikaru Shindo, Quentin Delfosse, Jiahong Xue, Kristian Kersting

TL;DR
STORM is a unified framework for 6D object tracking from a single image that improves robustness and reduces manual effort using hierarchical attention and a tracking verifier.
Contribution
It introduces HSFA for flexible reference-query fusion and a BCE-trained verifier for drift detection, enhancing pose tracking from minimal input.
Findings
Outperforms strong baselines on LM-O and YCB-Video datasets.
Recovers reliably from occlusions and rapid viewpoint changes.
Operates with minimal manual input and annotation-free.
Abstract
Accurate 6D pose estimation and tracking are core capabilities for physical AI systems, yet real-world deployment remains brittle and labor-intensive. Many pipelines rely on CAD models, manual masking, or per-object adaptation, and still fail under occlusion or fast motion without a principled way to recognize failure. We propose STORM, a unified framework for reference-conditioned 6D tracking that can operate from a single reference image, with minimal manual input and improved robustness. STORM combines: (i) Hierarchical Spatial Fusion Attention (HSFA), a task-driven reference-query fusion architecture that supports both single-reference and multi-reference conditioning and can optionally use vision-language semantic conditioning to resolve instance ambiguities; and (ii) a BCE-trained tracking verifier whose continuous compatibility logit is used as an energy-like score to detect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
