OTCR: Optimal Transmission, Compression and Representation for Multimodal Information Extraction

Yang Li; Yajiao Wang; Wenhao Hu; Zhixiong Zhang; Mengting Zhang

arXiv:2511.14766·cs.IR·November 20, 2025

OTCR: Optimal Transmission, Compression and Representation for Multimodal Information Extraction

Yang Li, Yajiao Wang, Wenhao Hu, Zhixiong Zhang, Mengting Zhang

PDF

Open Access

TL;DR

OTCR introduces a task-centric, interpretable framework for multimodal information extraction that selectively fuses text and visual cues, improving efficiency and reducing redundancy in document AI applications.

Contribution

It proposes a novel two-stage approach using optimal transport and information bottleneck to enhance multimodal fusion with controllable, task-specific representations.

Findings

01

Achieves state-of-the-art results on FUNSD and XFUND datasets.

02

Reduces modality redundancy and enhances task signal strength.

03

Provides an interpretable, information-theoretic fusion paradigm.

Abstract

Multimodal Information Extraction (MIE) requires fusing text and visual cues from visually rich documents. While recent methods have advanced multimodal representation learning, most implicitly assume modality equivalence or treat modalities in a largely uniform manner, still relying on generic fusion paradigms. This often results in indiscriminate incorporation of multimodal signals and insufficient control over task-irrelevant redundancy, which may in turn limit generalization. We revisit MIE from a task-centric view: text should dominate, vision should selectively support. We present OTCR, a two-stage framework. First, Cross-modal Optimal Transport (OT) yields sparse, probabilistic alignments between text tokens and visual patches, with a context-aware gate controlling visual injection. Second, a Variational Information Bottleneck (VIB) compresses fused features, filtering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Topic Modeling