Beyond Detection: A Structure-Aware Framework for Scene Text Tracking

Chenmin Yu; Liu Yu; Daiqing Wu; Gengluo Li; Zeyu Chen; Yu Zhou

arXiv:2605.17270·cs.CV·May 19, 2026

Beyond Detection: A Structure-Aware Framework for Scene Text Tracking

Chenmin Yu, Liu Yu, Daiqing Wu, Gengluo Li, Zeyu Chen, Yu Zhou

PDF

TL;DR

This paper introduces SymTrack, a novel structure-aware framework for scene text tracking in videos, addressing geometric distortions and visual ambiguities, and establishes new state-of-the-art results on multiple benchmarks.

Contribution

It presents the first systematic approach for scene text tracking, incorporating a dual-branch design with calibration and rectification mechanisms, and creates a new benchmark from existing datasets.

Findings

01

SymTrack outperforms previous trackers by up to 11.97% AUC.

02

It effectively handles geometric distortions and visual ambiguities.

03

The framework achieves state-of-the-art results on three benchmark datasets.

Abstract

Modern visual object trackers show impressive results on general targets, yet their performance drops substantially when dealing with scene text. Although currently underexplored, tracking text in videos is essential for dynamic text manipulations such as segmentation, removal, and editing. To fill this gap, this paper formalizes this specific task as Scene Text Tracking and presents the first systematic work for it. We identify three primary challenges in this task: 1) severe geometric distortions from perspective shifts, 2) high visual ambiguity across different instances, and 3) high sensitivity to fine-grained structural details. To address these issues, we propose SymTrack, a unified detection-free framework with synergistic dual-branch design. It integrates a Cross-Expert Calibration mechanism to reduce semantic bias, along with a Predictive Token Rectification mechanism to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.