MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling

Liang Yin; Xudong Xie; Zhang Li; Xiang Bai; Yuliang Liu

arXiv:2506.10609·cs.CV·December 23, 2025

MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling

Liang Yin, Xudong Xie, Zhang Li, Xiang Bai, Yuliang Liu

PDF

Open Access 1 Repo

TL;DR

MSTAR introduces a box-free, multi-query scene text retrieval method utilizing attention recycling and progressive vision embedding, achieving superior performance without costly bounding box annotations and establishing a new benchmark dataset.

Contribution

The paper presents MSTAR, a novel box-free approach for multi-query scene text retrieval that unifies diverse query types and introduces the MQTR dataset for comprehensive evaluation.

Findings

01

MSTAR outperforms previous models by 6.4% MAP on Total-Text.

02

MSTAR surpasses prior models by 8.5% on the MQTR benchmark.

03

Eliminates the need for bounding box annotations in scene text retrieval.

Abstract

Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yingift/mstar
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques

MethodsSoftmax · Attention Is All You Need · ADaptive gradient method with the OPTimal convergence rate