Hyperbolic Multimodal Generative Representation Learning for Generalized Zero-Shot Multimodal Information Extraction

Baohang Zhou; Kehui Song; Rize Jin; Yu Zhao; Xuhui Sui; Xinying Qian; Xingyue Guo; Ying Zhang

arXiv:2603.16259·cs.MM·March 18, 2026

Hyperbolic Multimodal Generative Representation Learning for Generalized Zero-Shot Multimodal Information Extraction

Baohang Zhou, Kehui Song, Rize Jin, Yu Zhao, Xuhui Sui, Xinying Qian, Xingyue Guo, Ying Zhang

PDF

Open Access

TL;DR

This paper introduces HMGRL, a hyperbolic space-based framework for generalized zero-shot multimodal information extraction, effectively modeling hierarchical semantic relationships and improving recognition of both seen and unseen categories.

Contribution

The paper proposes a novel hyperbolic multimodal generative framework with semantic similarity alignment for better zero-shot information extraction.

Findings

01

HMGRL outperforms baseline methods on benchmark datasets.

02

Hyperbolic space captures hierarchical semantic relationships effectively.

03

Semantic similarity distribution alignment improves generalization.

Abstract

Multimodal information extraction (MIE) constitutes a set of essential tasks aimed at extracting structural information from Web texts with integrating images, to facilitate the structural construction of Web-based semantic knowledge. To address the expanding category set including newly emerging entity types or relations on websites, prior research proposed the zero-shot MIE (ZS-MIE) task which aims to extract unseen structural knowledge with textual and visual modalities. However, the ZS-MIE models are limited to recognizing the samples that fall within the unseen category set, and they struggle to deal with real-world scenarios that encompass both seen and unseen categories. The shortcomings of existing methods can be ascribed to two main aspects. On one hand, these methods construct representations of samples and categories within Euclidean space, failing to capture the hierarchical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Text and Document Classification Technologies