Hyperbolic Learning with Multimodal Large Language Models
Paolo Mandica, Luca Franco, Konstantinos Kallidromitis, Suzanne, Petryk, Fabio Galasso

TL;DR
This paper introduces a scalable hyperbolic learning approach for multimodal large language models, demonstrating stable training and meaningful uncertainty estimation, advancing the integration of hyperbolic embeddings in vision-language tasks.
Contribution
It proposes a novel training strategy for hyperbolic multimodal models, enabling scaling to billions of parameters with stable training and uncertainty insights.
Findings
Achieved comparable performance to Euclidean models
Maintained training stability at large scale
Provided meaningful uncertainty indications
Abstract
Hyperbolic embeddings have demonstrated their effectiveness in capturing measures of uncertainty and hierarchical relationships across various deep-learning tasks, including image segmentation and active learning. However, their application in modern vision-language models (VLMs) has been limited. A notable exception is MERU, which leverages the hierarchical properties of hyperbolic space in the CLIP ViT-large model, consisting of hundreds of millions parameters. In our work, we address the challenges of scaling multi-modal hyperbolic models by orders of magnitude in terms of parameters (billions) and training complexity using the BLIP-2 architecture. Although hyperbolic embeddings offer potential insights into uncertainty not present in Euclidean embeddings, our analysis reveals that scaling these models is particularly difficult. We propose a novel training strategy for a hyperbolic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training
