ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking

Sijia Chen; Zihan Zhou; Yanqiu Yu; En Yu; Wenbing Tao

arXiv:2603.05384·cs.CV·March 6, 2026

ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking

Sijia Chen, Zihan Zhou, Yanqiu Yu, En Yu, Wenbing Tao

PDF

Open Access

TL;DR

This paper introduces ORMOT, a new task and dataset for tracking objects in omnidirectional videos based on language descriptions, addressing the limitations of conventional camera-based datasets.

Contribution

The work presents a novel omnidirectional RMOT task, constructs the ORSet dataset with diverse scenes and annotations, and proposes the ORTrack framework leveraging large vision-language models.

Findings

01

ORTrack outperforms baseline methods on ORSet.

02

The dataset contains 27 scenes, 848 descriptions, and 3,401 objects.

03

Omnidirectional imagery improves long-horizon language understanding.

Abstract

Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Surveillance and Tracking Methods · Advanced Neural Network Applications