DEVICE: Depth and Visual Concepts Aware Transformer for OCR-based Image   Captioning

Dongsheng Xu; Qingbao Huang; Xingmao Zhang; Haonan Cheng; Feng Shuang,; Yi Cai

arXiv:2302.01540·cs.CV·April 29, 2025·1 cites

DEVICE: Depth and Visual Concepts Aware Transformer for OCR-based Image Captioning

Dongsheng Xu, Qingbao Huang, Xingmao Zhang, Haonan Cheng, Feng Shuang,, Yi Cai

PDF

Open Access

TL;DR

This paper introduces DEVICE, a transformer model that incorporates depth and semantic visual concepts to improve OCR-based image captioning, resulting in more accurate and comprehensive scene descriptions.

Contribution

The paper proposes a novel transformer architecture that integrates depth information and semantic visual concepts for enhanced scene understanding in OCR-based captioning.

Findings

01

DEVICE outperforms state-of-the-art models on TextCaps dataset

02

Incorporating depth improves scene text relational reasoning

03

Semantic-guided alignment enhances visual concept utilization

Abstract

OCR-based image captioning is an important but under-explored task, aiming to generate descriptions containing visual objects and scene text. Recent studies have made encouraging progress, but they are still suffering from a lack of overall understanding of scenes and generating inaccurate captions. One possible reason is that current studies mainly focus on constructing the plane-level geometric relationship of scene text without depth information. This leads to insufficient scene text relational reasoning so that models may describe scene text inaccurately. The other possible reason is that existing methods fail to generate fine-grained descriptions of some visual objects. In addition, they may ignore essential visual objects, leading to the scene text belonging to these ignored objects not being utilized. To address the above issues, we propose a Depth and Visual Concepts Aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Handwritten Text Recognition Techniques

MethodsMulti-Head Attention · Attention Is All You Need · fail · Test · Linear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing