Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification
Xiang Bai, Mingkun Yang, Pengyuan Lyu, Yongchao Xu, Jiebo Luo

TL;DR
This paper proposes a deep learning framework that combines scene text recognition and visual features to improve fine-grained image classification and retrieval, demonstrating significant performance gains on relevant datasets.
Contribution
It introduces a novel method integrating scene text and visual features into a trainable CNN for enhanced image classification and retrieval.
Findings
Significant accuracy improvement over visual-only methods.
Effective use of attention mechanism to relate text and images.
Enhanced retrieval performance in product search applications.
Abstract
Text in natural images contains rich semantics that are often highly relevant to objects or scene. In this paper, we focus on the problem of fully exploiting scene text for visual understanding. The main idea is combining word representations and deep visual features into a globally trainable deep convolutional neural network. First, the recognized words are obtained by a scene text reading system. Then, we combine the word embedding of the recognized words and the deep visual features into a single representation, which is optimized by a convolutional neural network for fine-grained image classification. In our framework, the attention mechanism is adopted to reveal the relevance between each recognized word and the given image, which further enhances the recognition performance. We have performed experiments on two datasets: Con-Text dataset and Drink Bottle dataset, that are proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
