DEVICE: Depth and Visual Concepts Aware Transformer for OCR-based Image Captioning
Dongsheng Xu, Qingbao Huang, Xingmao Zhang, Haonan Cheng, Feng Shuang,, Yi Cai

TL;DR
This paper introduces DEVICE, a transformer model that incorporates depth and semantic visual concepts to improve OCR-based image captioning, resulting in more accurate and comprehensive scene descriptions.
Contribution
The paper proposes a novel transformer architecture that integrates depth information and semantic visual concepts for enhanced scene understanding in OCR-based captioning.
Findings
DEVICE outperforms state-of-the-art models on TextCaps dataset
Incorporating depth improves scene text relational reasoning
Semantic-guided alignment enhances visual concept utilization
Abstract
OCR-based image captioning is an important but under-explored task, aiming to generate descriptions containing visual objects and scene text. Recent studies have made encouraging progress, but they are still suffering from a lack of overall understanding of scenes and generating inaccurate captions. One possible reason is that current studies mainly focus on constructing the plane-level geometric relationship of scene text without depth information. This leads to insufficient scene text relational reasoning so that models may describe scene text inaccurately. The other possible reason is that existing methods fail to generate fine-grained descriptions of some visual objects. In addition, they may ignore essential visual objects, leading to the scene text belonging to these ignored objects not being utilized. To address the above issues, we propose a Depth and Visual Concepts Aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Handwritten Text Recognition Techniques
MethodsMulti-Head Attention · Attention Is All You Need · fail · Test · Linear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing
