Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings

Sarang Patil; Ashish Parmanand Pandey; Ioannis Koutis; Mengjia Xu

arXiv:2505.18973·cs.CL·December 8, 2025

Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings

Sarang Patil, Ashish Parmanand Pandey, Ioannis Koutis, Mengjia Xu

PDF

Open Access 1 Repo 4 Reviews

TL;DR

This paper introduces Hierarchical Mamba (HiM), a novel framework combining hyperbolic geometry with sequence models to improve hierarchical language understanding and reasoning in NLP tasks.

Contribution

The paper proposes HiM, integrating Mamba2 with hyperbolic geometry, enabling hierarchy-aware embeddings that outperform Euclidean models on multiple hierarchical NLP tasks.

Findings

01

HiM effectively captures hierarchical relationships in language data.

02

HiM variants outperform Euclidean baselines on four datasets.

03

HiM-Poincaré provides detailed hierarchical distinctions, HiM-Lorentz offers robustness.

Abstract

Selective state-space models excel at long-sequence modeling, but their capacity for language representation -- in complex hierarchical reasoning -- remains underexplored. Most large language models rely on \textit{flat} Euclidean embeddings, limiting their ability to capture latent hierarchies. To address this, we propose {\it Hierarchical Mamba (HiM)}, integrating efficient Mamba2 with hyperbolic geometry to learn hierarchy-aware language embeddings for deeper linguistic understanding. Mamba2-processed sequences are projected to the Poincar\'e ball or Lorentzian manifold with ``learnable'' curvature, optimized with a hyperbolic loss. Our HiM model facilitates the capture of relational distances across varying hierarchical levels, enabling effective long-range reasoning for tasks like mixed-hop prediction and multi-hop inference in hierarchical classification. Experimental results show…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

New exploitation, Mamba2+Hyperbolic

Weaknesses

The paper assumes that the fusion of Mamba2 and hyperbolic space can significantly enhance long-sequence and hierarchical reasoning capabilities, but the experiments mainly focus on four preset Ontology-class datasets, failing to cover large-scale real-world scenarios in natural language processing (such as open-domain question answering or multi-document reasoning), resulting in insufficient model generalizability. Although the authors compared their approach with Euclidean Mamba and Hyperboli

Reviewer 02Rating 4Confidence 3

Strengths

- Significant performance gains. The method delivers consistently strong improvements across tasks. - Systematic manifold/curvature study. It compares Poincaré and Lorentz models across multiple datasets and tasks, analyzing how curvature and manifold choice affect performance and stability.

Weaknesses

1. The paper compares only against Mamba, lacking head-to-head evaluations with Hyperbolic Transformers and Euclidean Transformers under matched settings to quantify both accuracy and runtime gains. 2. There is no theoretical or empirical accounting of the constant-factor cost of hyperbolic operations versus Euclidean ones (e.g., training/inference latency, memory), despite relying on projections/distances (cosh/sinh, exp/log maps). 3. Incomplete approximation analysis. The Maclaurin approximat

Reviewer 03Rating 4Confidence 2

Strengths

The core idea of combining the $O(L)$ efficiency of Mamba2 with the $O(L)$ representational power of hyperbolic geometry for hierarchical data is novel and well-motivated. The introduction of the SentenceMamba-16M model provides a lightweight and efficient backbone for sentence embedding tasks. The inclusion of a zero-shot comparison against GPT-4o on the WordNet task is a strong addition, demonstrating that a small, specialized model can outperform a massive, general-purpose one on a specific

Weaknesses

The paper's claim that Mamba2's selective properties are key to its success is not fully proven. Since the methodology applies mean pooling to the Mamba2 block outputs to get a single vector, it is unclear if Mamba's selectivity is providing a benefit beyond just being an efficient $O(L)$ encoder (maybe LSTM?) A critical detail seems vague. The paper mentions a "geometric stabilization technique that periodically projects the model parameters back onto the manifold" every 100 steps. It does not

Reviewer 04Rating 2Confidence 4

Strengths

1) First work of intergrating Mamba2 and hyperbolic embeddings. 2) The paper includes both Poincaré and Lorentz formulations, curvature regularization, and stability approximations. 3) empirical study across four ontology datasets, including δ-hyperbolicity analysis and visualization of embedding hierarchies.

Weaknesses

1) Questionable motivation for long-range dependency modeling. Ontology reasoning is structurally hierarchical but not sequential. It’s unclear why Mamba’s sequence modeling is advantageous here. The link between “long-range dependencies” and “multi-hop ontology reasoning” is weak and mismatched. 2) Limited novelty. Integrating a known efficient sequence model (Mamba2) with hyperbolic embeddings is incremental, given prior works like SHMamba (2024) and Hyperbolic BERT / HiT (He et al 2024) / H

Code & Models

Repositories

berrybyte/him
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEvolutionary Algorithms and Applications · Human Motion and Animation

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces