OTCR: Optimal Transmission, Compression and Representation for Multimodal Information Extraction
Yang Li, Yajiao Wang, Wenhao Hu, Zhixiong Zhang, Mengting Zhang

TL;DR
OTCR introduces a task-centric, interpretable framework for multimodal information extraction that selectively fuses text and visual cues, improving efficiency and reducing redundancy in document AI applications.
Contribution
It proposes a novel two-stage approach using optimal transport and information bottleneck to enhance multimodal fusion with controllable, task-specific representations.
Findings
Achieves state-of-the-art results on FUNSD and XFUND datasets.
Reduces modality redundancy and enhances task signal strength.
Provides an interpretable, information-theoretic fusion paradigm.
Abstract
Multimodal Information Extraction (MIE) requires fusing text and visual cues from visually rich documents. While recent methods have advanced multimodal representation learning, most implicitly assume modality equivalence or treat modalities in a largely uniform manner, still relying on generic fusion paradigms. This often results in indiscriminate incorporation of multimodal signals and insufficient control over task-irrelevant redundancy, which may in turn limit generalization. We revisit MIE from a task-centric view: text should dominate, vision should selectively support. We present OTCR, a two-stage framework. First, Cross-modal Optimal Transport (OT) yields sparse, probabilistic alignments between text tokens and visual patches, with a context-aware gate controlling visual injection. Second, a Variational Information Bottleneck (VIB) compresses fused features, filtering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Topic Modeling
