Hyperbolic Learning with Multimodal Large Language Models

Paolo Mandica; Luca Franco; Konstantinos Kallidromitis; Suzanne; Petryk; Fabio Galasso

arXiv:2408.05097·cs.LG·August 12, 2024

Hyperbolic Learning with Multimodal Large Language Models

Paolo Mandica, Luca Franco, Konstantinos Kallidromitis, Suzanne, Petryk, Fabio Galasso

PDF

TL;DR

This paper introduces a scalable hyperbolic learning approach for multimodal large language models, demonstrating stable training and meaningful uncertainty estimation, advancing the integration of hyperbolic embeddings in vision-language tasks.

Contribution

It proposes a novel training strategy for hyperbolic multimodal models, enabling scaling to billions of parameters with stable training and uncertainty insights.

Findings

01

Achieved comparable performance to Euclidean models

02

Maintained training stability at large scale

03

Provided meaningful uncertainty indications

Abstract

Hyperbolic embeddings have demonstrated their effectiveness in capturing measures of uncertainty and hierarchical relationships across various deep-learning tasks, including image segmentation and active learning. However, their application in modern vision-language models (VLMs) has been limited. A notable exception is MERU, which leverages the hierarchical properties of hyperbolic space in the CLIP ViT-large model, consisting of hundreds of millions parameters. In our work, we address the challenges of scaling multi-modal hyperbolic models by orders of magnitude in terms of parameters (billions) and training complexity using the BLIP-2 architecture. Although hyperbolic embeddings offer potential insights into uncertainty not present in Euclidean embeddings, our analysis reveals that scaling these models is particularly difficult. We propose a novel training strategy for a hyperbolic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training