ARGENT: Adaptive Hierarchical Image-Text Representations
Chuong Huynh, Hossein Souri, Abhinav Kumar, Vitali Petsiuk, Deen Dayal Mohan, Suren Kumar

TL;DR
This paper introduces ARGENT, a hyperbolic vision-language model that effectively captures hierarchical structures in image-text representations, improving state-of-the-art performance and proposing new evaluation metrics for hierarchy understanding.
Contribution
The paper presents an adaptive hyperbolic embedding method with a novel entailment loss and a new hierarchical evaluation protocol, advancing hyperbolic VLM capabilities.
Findings
ARGENT outperforms previous hyperbolic VLMs on multiple benchmarks.
The adaptive loss prevents cone collapse and maintains hierarchy integrity.
New hierarchical metrics provide more reliable evaluation of hierarchical understanding.
Abstract
Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Face recognition and analysis
