MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling
Liang Yin, Xudong Xie, Zhang Li, Xiang Bai, Yuliang Liu

TL;DR
MSTAR introduces a box-free, multi-query scene text retrieval method utilizing attention recycling and progressive vision embedding, achieving superior performance without costly bounding box annotations and establishing a new benchmark dataset.
Contribution
The paper presents MSTAR, a novel box-free approach for multi-query scene text retrieval that unifies diverse query types and introduces the MQTR dataset for comprehensive evaluation.
Findings
MSTAR outperforms previous models by 6.4% MAP on Total-Text.
MSTAR surpasses prior models by 8.5% on the MQTR benchmark.
Eliminates the need for bounding box annotations in scene text retrieval.
Abstract
Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques
MethodsSoftmax · Attention Is All You Need · ADaptive gradient method with the OPTimal convergence rate
