Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds
Wu Wei, Xiaomeng Fan, Yuwei Wu, Zhi Gao, Pengxiang Li, Yunde Jia, Mehrtash Harandi

TL;DR
This paper introduces a novel hierarchical feature alignment method for vision-language models that constructs tree-like features for both images and text, embedding them into hyperbolic spaces with different curvatures to improve cross-modal alignment.
Contribution
It proposes a new approach to align hierarchical features across heterogeneous hyperbolic manifolds, including a semantic-aware visual feature extraction framework and a KL-based manifold alignment technique.
Findings
Outperforms strong baselines in open-set classification tasks
Effective in few-shot and cross-domain scenarios
Demonstrates the benefit of hyperbolic manifold embedding for hierarchical feature alignment
Abstract
Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous…
Peer Reviews
Decision·ICLR 2026 Poster
- Interesting idea worth exploring - Some promising experimental results - Generalization to novel classes investigated
- Most of the examples focus on hierarchical relationships from biological taxonomy. While there are experiments on more diverse datasets such as ImageNet, the paper does not provide much analysis and insight into how the model behaves on other kinds of hierarchies, especially when the data is more diverse. - On the Rare Species Dataset, the authors do not stick with the original labels but instead use hierarchical label names created by concatenating the various hierarchy levels in Algorithm 1
S1. The paper identifies and tackles a issue in VLMs—the mismatch in representational hierarchy between vision and language. Figure 1 clearly illustrates this contrast, highlighting how the proposed tree-based structure achieves more symmetric alignment. S2.This paper has comprehensive ablation and visualization.
W1. The Taylor approximation used to compute the KL-based manifold distance (Appendix A) lacks empirical evaluation. No sensitivity analysis is provided for the approximation constant 𝑟, raising concerns about robustness and stability. W2. The paper does not analyze potential failure modes, computational overhead from learning multiple curvatures. W3. The approach to class imbalance across different tree levels is not explained, leaving open questions about potential bias in curvature learning
- The paper addresses an important and interesting problem of achieving efficient and symmetric modality alignment for hierarchical semantic structures. - The work presents a novel approach by leveraging hyperbolic spaces with learnable curvatures and an intermediate manifold for modality alignment between images and text. - Comprehensive experiments demonstrate the method's effectiveness, showing consistent improvements across multiple datasets and settings for taxonomic classification.
- The design of the semantic-aware feature extraction module appears somewhat heuristic and lacks strong theoretical justification. The critical choices of which specific intermediate transformer layers to use and the decision to disable cross-token attention are based on empirical observations rather than derived from first principles. This raises concerns about the generalizability and robustness of this core component, as its effectiveness might be sensitive to the specific architecture (e.g.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
