Choose What You Need: Disentangled Representation Learning for Scene   Text Recognition, Removal and Editing

Boqiang Zhang; Hongtao Xie; Zuan Gao; Yuxin Wang

arXiv:2405.04377·cs.CV·May 8, 2024

Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing

Boqiang Zhang, Hongtao Xie, Zuan Gao, Yuxin Wang

PDF

Open Access

TL;DR

This paper introduces DARLING, a framework that disentangles style and content features in scene text images, enabling improved performance across recognition, removal, and editing tasks by explicitly separating these features.

Contribution

The paper presents the first approach to disentangle style and content features in scene text images, enhancing adaptability for various downstream tasks.

Findings

01

Achieves state-of-the-art results in scene text recognition.

02

Improves scene text removal and editing performance.

03

Effectively decouples style and content features in images.

Abstract

Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Digital Media Forensic Detection