HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models
Zelin Peng, Zhengqin Xu, Qingyang Liu, Xiaokang Yang, Wei Shen

TL;DR
HyperET introduces an efficient hyperbolic space-based training method for multi-modal large language models, enabling hierarchical alignment at various granularities with minimal additional parameters.
Contribution
The paper proposes HyperET, a novel hyperbolic space-based training paradigm that improves multi-modal alignment efficiency in large language models with minimal parameter overhead.
Findings
HyperET consistently enhances performance across multiple benchmarks.
It achieves these improvements with less than 1% additional parameters.
HyperET reduces training resource requirements for multi-modal models.
Abstract
Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
