Improving Token-based Object Detection with Video

Abhineet Singh; Nilanjan Ray

arXiv:2506.22562·cs.CV·August 21, 2025

Improving Token-based Object Detection with Video

Abhineet Singh, Nilanjan Ray

PDF

TL;DR

This paper extends the Pix2Seq object detector to videos, representing objects as variable-length token sequences and 3D tracklets, leading to improved detection and tracking performance without complex heuristics.

Contribution

It introduces a novel end-to-end video object detection method using token sequences and 3D tracklets, simplifying the process and enhancing scalability.

Findings

01

Consistent improvement over the baseline Pix2Seq detector.

02

Competitive performance on UA-DETRAC dataset.

03

Scalability with longer video subsequences.

Abstract

This paper improves upon the Pix2Seq object detector by extending it for videos. In the process, it introduces a new way to perform end-to-end video object detection that improves upon existing video detectors in two key ways. First, by representing objects as variable-length sequences of discrete tokens, we can succinctly represent widely varying numbers of video objects, with diverse shapes and locations, without having to inject any localization cues in the training process. This eliminates the need to sample the space of all possible boxes that constrains conventional detectors and thus solves the dual problems of loss sparsity during training and heuristics-based postprocessing during inference. Second, it conceptualizes and outputs the video objects as fully integrated and indivisible 3D boxes or tracklets instead of generating image-specific 2D boxes and linking these boxes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.