Box2Poly: Memory-Efficient Polygon Prediction of Arbitrarily Shaped and Rotated Text
Xuyang Chen, Dong Wang, Konrad Schindler, Mingwei Sun, Yongliang Wang, Nicolo Savioli, Liqiu Meng

TL;DR
Box2Poly introduces a memory-efficient, cascade-based polygon prediction method for irregularly shaped and rotated text detection, improving efficiency and maintaining accuracy compared to existing Transformer-based approaches.
Contribution
The paper proposes a novel cascade decoding pipeline inspired by Sparse R-CNN for polygon prediction, significantly reducing memory usage and inference time while preserving detection quality.
Findings
Over 50% memory reduction compared to DPText-DETR
More than 40% faster inference speed than DPText-DETR
Minor performance drop on standard benchmarks
Abstract
Recently, Transformer-based text detection techniques have sought to predict polygons by encoding the coordinates of individual boundary vertices using distinct query features. However, this approach incurs a significant memory overhead and struggles to effectively capture the intricate relationships between vertices belonging to the same instance. Consequently, irregular text layouts often lead to the prediction of outlined vertices, diminishing the quality of results. To address these challenges, we present an innovative approach rooted in Sparse R-CNN: a cascade decoding pipeline for polygon prediction. Our method ensures precision by iteratively refining polygon predictions, considering both the scale and location of preceding results. Leveraging this stabilized regression pipeline, even employing just a single feature vector to guide polygon instance regression yields promising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Image Retrieval and Classification Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
