Box2Poly: Memory-Efficient Polygon Prediction of Arbitrarily Shaped and Rotated Text

Xuyang Chen; Dong Wang; Konrad Schindler; Mingwei Sun; Yongliang Wang; Nicolo Savioli; Liqiu Meng

arXiv:2309.11248·cs.CV·August 13, 2025

Box2Poly: Memory-Efficient Polygon Prediction of Arbitrarily Shaped and Rotated Text

Xuyang Chen, Dong Wang, Konrad Schindler, Mingwei Sun, Yongliang Wang, Nicolo Savioli, Liqiu Meng

PDF

Open Access

TL;DR

Box2Poly introduces a memory-efficient, cascade-based polygon prediction method for irregularly shaped and rotated text detection, improving efficiency and maintaining accuracy compared to existing Transformer-based approaches.

Contribution

The paper proposes a novel cascade decoding pipeline inspired by Sparse R-CNN for polygon prediction, significantly reducing memory usage and inference time while preserving detection quality.

Findings

01

Over 50% memory reduction compared to DPText-DETR

02

More than 40% faster inference speed than DPText-DETR

03

Minor performance drop on standard benchmarks

Abstract

Recently, Transformer-based text detection techniques have sought to predict polygons by encoding the coordinates of individual boundary vertices using distinct query features. However, this approach incurs a significant memory overhead and struggles to effectively capture the intricate relationships between vertices belonging to the same instance. Consequently, irregular text layouts often lead to the prediction of outlined vertices, diminishing the quality of results. To address these challenges, we present an innovative approach rooted in Sparse R-CNN: a cascade decoding pipeline for polygon prediction. Our method ensures precision by iteratively refining polygon predictions, considering both the scale and location of preceding results. Leveraging this stabilized regression pipeline, even employing just a single feature vector to guide polygon instance regression yields promising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Image Retrieval and Classification Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings