TL;DR
HAC introduces a parameter-efficient method to adapt pretrained CLIP models into hyperbolic space for zero-shot VQA, improving performance across diverse benchmarks without retraining from scratch.
Contribution
The paper proposes HAC, a lightweight fine-tuning framework that enables hyperbolic adaptation of CLIP for zero-shot VQA, outperforming previous hyperbolic methods and Euclidean baselines.
Findings
HAC outperforms Euclidean baselines on VQA benchmarks.
HAC-B improves reasoning tasks by up to 1.9 points over CLIP-B.
HAC achieves zero-shot adaptation without dataset overlap.
Abstract
Recent advances in representation learning have shown that hyperbolic geometry can offer a more expressive alternative to the Euclidean embeddings used in CLIP models, capturing hierarchical structures and leading to better-organized representations. However, current hyperbolic CLIP variants are trained entirely from scratch, which is computationally expensive and resource-intensive. In this work, we propose HAC (Hyperbolic Adaptation of CLIP), a parameter-efficient framework that enables pretrained CLIP models to transition into hyperbolic space via lightweight fine-tuning. We apply HAC to Visual Question Answering (VQA), where models must interpret visual elements and align them with textual queries. Notably, HAC's training is performed on a dataset with no overlap with any VQA benchmark, resulting in a strict zero-shot evaluation paradigm that underscores HAC's task-agnostic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
