LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model
Hongen Liu, Di Sun, Jiahao Wang, Yi Liu, Gang Pan

TL;DR
The paper introduces LOGO, a novel video text spotting framework that combines language collaboration and glyph perception to improve detection, recognition, and tracking of text in videos, especially under challenging conditions.
Contribution
LOGO integrates a language synergy classifier and glyph supervision into existing text spotters, enhancing low-resolution text detection and recognition accuracy without extensive fine-tuning.
Findings
Improves detection and recognition of low-resolution text instances.
Effectively filters out text-like background regions.
Achieves state-of-the-art performance on public benchmarks.
Abstract
Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos. To address the limited recognition capability of end-to-end methods, recent methods track the zero-shot results of state-of-the-art image text spotters directly, and achieve impressive performance. However, owing to the domain gap between different datasets, these methods usually obtain limited tracking trajectories on extreme dataset. Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements, albeit at the expense of considerable training resources. In this paper, we propose a Language Collaboration and Glyph Perception Model, termed LOGO, an innovative framework designed to enhance the performance of conventional text spotters. To achieve this goal, we design a language synergy classifier (LSC) to explicitly discern text instances from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Human Motion and Animation · Multimodal Machine Learning Applications
