HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment

Ruijia Wu; Ping Chen; Fei Shen; Shaoan Zhao; Qiang Hui; Huanlin Gao; Ting Lu; Zhaoxiang Liu; Fang Zhao; Kai Wang; Shiguo Lian

arXiv:2511.06653·cs.CV·November 11, 2025

HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment

Ruijia Wu, Ping Chen, Fei Shen, Shaoan Zhao, Qiang Hui, Huanlin Gao, Ting Lu, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian

PDF

Open Access 1 Video

TL;DR

HiMo-CLIP enhances vision-language models by incorporating semantic hierarchy and monotonicity, enabling better handling of complex, compositional, and long-form descriptions without changing the core encoder architecture.

Contribution

It introduces a hierarchical decomposition module and a monotonicity-aware contrastive loss to improve semantic understanding in CLIP-style models.

Findings

01

Outperforms baselines on image-text retrieval benchmarks.

02

Excels with long and compositional descriptions.

03

Produces structured, cognitively-aligned cross-modal representations.

Abstract

Contrastive vision-language models like CLIP have achieved impressive results in image-text retrieval by aligning image and text representations in a shared embedding space. However, these models often treat text as flat sequences, limiting their ability to handle complex, compositional, and long-form descriptions. In particular, they fail to capture two essential properties of language: semantic hierarchy, which reflects the multi-level compositional structure of text, and semantic monotonicity, where richer descriptions should result in stronger alignment with visual content.To address these limitations, we propose HiMo-CLIP, a representation-level framework that enhances CLIP-style models without modifying the encoder architecture. HiMo-CLIP introduces two key components: a hierarchical decomposition (HiDe) module that extracts latent semantic components from long-form text via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning