Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

Nhi Ngoc-Yen Nguyen; Anh-Duc Nguyen; Nghia Hieu Nguyen; Kiet Van Nguyen; Ngan Luu-Thuy Nguyen

arXiv:2604.27712·cs.CV·May 1, 2026

Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

Nhi Ngoc-Yen Nguyen, Anh-Duc Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

PDF

TL;DR

This paper introduces a linguistically informed multimodal fusion framework for Vietnamese scene-text image captioning, addressing language-specific challenges with a new dataset, graph models, and phonological reasoning.

Contribution

It proposes HSTFG and PhonoSTFG, novel graph-based fusion models incorporating Vietnamese linguistic features, and introduces ViTextCaps, the first large-scale Vietnamese scene-text captioning dataset.

Findings

01

Cross-modal graph edges are harmful for scene-text fusion.

02

Phonological fusion improves Vietnamese captioning accuracy.

03

ViTextCaps enables comprehensive evaluation of Vietnamese scene-text captioning.

Abstract

Scene-text image captioning requires fusing three information streams -- visual features, OCR-detected text, and linguistic knowledge -- to generate descriptions that faithfully integrate text visible in images. Existing fusion approaches treat text as language-agnostic, which fails for Vietnamese: a tonal language where diacritics alter word meaning, OCR errors are pervasive, and word boundaries are ambiguous. We argue that Vietnamese scene-text captioning demands \textit{linguistically informed multimodal fusion}, where language-specific structural knowledge is explicitly incorporated into the fusion mechanism. Motivated from these insights, we propose \textbf{HSTFG} (Heterogeneous Scene-Text Fusion Graph), a general-purpose graph fusion framework with learned spatial attention bias, and show through topology analysis that cross-modal graph edges are harmful for scene-text fusion.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.