OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding

Teng Fu; Mengyang Zhao; Ke Niu; Kaixin Peng; Bin Li

arXiv:2511.17053·cs.CV·November 24, 2025

OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding

Teng Fu, Mengyang Zhao, Ke Niu, Kaixin Peng, Bin Li

PDF

Open Access 1 Video

TL;DR

This paper introduces OmniPT, a unified framework leveraging large vision-language models for pedestrian tracking that combines tracking, referencing, and semantic understanding, showing improved performance over prior methods.

Contribution

The paper presents a novel Pedestrian Tracking framework, OmniPT, integrating tracking, referencing, and semantic understanding using LVLMs with a specialized training pipeline.

Findings

01

Outperforms previous pedestrian tracking methods on benchmark datasets.

02

Enables LVLMs to output formatted bounding box and tracking information.

03

Demonstrates improved semantic understanding of tracked objects.

Abstract

LVLMs have been shown to perform excellently in image-level tasks such as VQA and caption. However, in many instance-level tasks, such as visual grounding and object detection, LVLMs still show performance gaps compared to previous expert models. Meanwhile, although pedestrian tracking is a classical task, there have been a number of new topics in combining object tracking and natural language, such as Referring MOT, Cross-view Referring MOT, and Semantic MOT. These tasks emphasize that models should understand the tracked object at an advanced semantic level, which is exactly where LVLMs excel. In this paper, we propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively. We address two issues: how to model the tracking task into a task that foundation models can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding· underline

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Advanced Neural Network Applications