ARGENT: Adaptive Hierarchical Image-Text Representations

Chuong Huynh; Hossein Souri; Abhinav Kumar; Vitali Petsiuk; Deen Dayal Mohan; Suren Kumar

arXiv:2603.23311·cs.CV·March 25, 2026

ARGENT: Adaptive Hierarchical Image-Text Representations

Chuong Huynh, Hossein Souri, Abhinav Kumar, Vitali Petsiuk, Deen Dayal Mohan, Suren Kumar

PDF

Open Access

TL;DR

This paper introduces ARGENT, a hyperbolic vision-language model that effectively captures hierarchical structures in image-text representations, improving state-of-the-art performance and proposing new evaluation metrics for hierarchy understanding.

Contribution

The paper presents an adaptive hyperbolic embedding method with a novel entailment loss and a new hierarchical evaluation protocol, advancing hyperbolic VLM capabilities.

Findings

01

ARGENT outperforms previous hyperbolic VLMs on multiple benchmarks.

02

The adaptive loss prevents cone collapse and maintains hierarchy integrity.

03

New hierarchical metrics provide more reliable evaluation of hierarchical understanding.

Abstract

Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Face recognition and analysis