Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
Longwei Xu, Feng Feng, Shaojie Zhang, Xin Chen, Hang Li, Anan Du, Hailong Yu, Pei Fu, Zhenbo Luo, Jian Luan

TL;DR
Q-Mask introduces a causal, query-driven mask decoder for improved text anchoring in OCR-based vision-language models, supported by a large-scale dataset and benchmark.
Contribution
It presents a novel visual CoT paradigm with a causal mask decoder and a large annotated dataset to enhance text-region grounding in VLMs.
Findings
Q-Mask significantly improves text anchoring accuracy.
The TextAnchor-Bench reveals current models' limitations in stable text grounding.
Q-Mask enhances understanding in diverse visual scenes.
Abstract
Optical Character Recognition (OCR) is increasingly regarded as a foundational capability for modern vision-language models (VLMs), enabling them not only to read text in images but also to support downstream reasoning in real-world visual question answering (VQA). However, practical applications further require reliable text anchors, i.e., accurately grounding queried text to its corresponding spatial region. To systematically evaluate this capability, we introduce TextAnchor-Bench (TABench), a benchmark for fine-grained text-region grounding, which reveals that both general-purpose and OCR-specific VLMs still struggle to establish accurate and stable text anchors. To address this limitation, we propose Q-Mask, a precise OCR framework built upon a causal query-driven mask decoder (CQMD). Inspired by chain-of-thought reasoning, Q-Mask performs causal visual decoding that sequentially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
