Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models
Caner Erden

TL;DR
MAHA introduces a hierarchical attention framework for large language models that reduces computational complexity by dynamically partitioning input sequences and optimally aggregating multi-scale information through game theory and convex optimization.
Contribution
It presents a novel hierarchical attention mechanism with a mathematically rigorous aggregation strategy, improving scalability and global dependency modeling in LLMs.
Findings
Achieves 81% reduction in FLOPs at sequence length 4096.
Demonstrates superior scalability over standard attention mechanisms.
Enables end-to-end training with differentiable optimization layers.
Abstract
The quadratic computational complexity of MultiHead SelfAttention (MHSA) remains a fundamental bottleneck in scaling Large Language Models (LLMs) for longcontext tasks. While sparse and linearized attention mechanisms attempt to mitigate this, they often compromise the representation of global dependencies or fail to capture multiscale semantic granularity effectively. In this paper, we propose Multiscale Aggregated Hierarchical Attention (MAHA), a novel architectural framework that reformulates the attention mechanism through hierarchical decomposition and mathematically rigorous aggregation. Unlike conventional approaches that treat token interactions at a single resolution, MAHA dynamically partitions the input sequence into hierarchical scales via learnable downsampling operators. The core innovation lies in its aggregation strategy: we model the fusion of scalespecific attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Multimodal Machine Learning Applications
