Beyond Cropped Regions: New Benchmark and Corresponding Baseline for Chinese Scene Text Retrieval in Diverse Layouts

Gengluo Li; Huawen Shen; Yu Zhou

arXiv:2506.04999·cs.CV·June 6, 2025

Beyond Cropped Regions: New Benchmark and Corresponding Baseline for Chinese Scene Text Retrieval in Diverse Layouts

Gengluo Li, Huawen Shen, Yu Zhou

PDF

Open Access

TL;DR

This paper introduces a new benchmark and a novel model, CSTR-CLIP, for Chinese scene text retrieval across diverse layouts, significantly improving accuracy and speed over previous methods.

Contribution

It establishes the DL-CSVTR benchmark for Chinese text retrieval and proposes CSTR-CLIP, a model that effectively handles diverse text layouts with a two-stage training process.

Findings

01

CSTR-CLIP outperforms previous models by 18.82% accuracy.

02

CSTR-CLIP achieves faster inference speed.

03

DL-CSVTR effectively evaluates diverse Chinese text layouts.

Abstract

Chinese scene text retrieval is a practical task that aims to search for images containing visual instances of a Chinese query text. This task is extremely challenging because Chinese text often features complex and diverse layouts in real-world scenes. Current efforts tend to inherit the solution for English scene text retrieval, failing to achieve satisfactory performance. In this paper, we establish a Diversified Layout benchmark for Chinese Street View Text Retrieval (DL-CSVTR), which is specifically designed to evaluate retrieval performance across various text layouts, including vertical, cross-line, and partial alignments. To address the limitations in existing methods, we propose Chinese Scene Text Retrieval CLIP (CSTR-CLIP), a novel model that integrates global visual information with multi-granularity alignment training. CSTR-CLIP applies a two-stage training process to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Handwritten Text Recognition Techniques

MethodsContrastive Language-Image Pre-training