Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models

Caner Erden

arXiv:2512.14925·cs.CL·December 19, 2025

Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models

Caner Erden

PDF

Open Access

TL;DR

MAHA introduces a hierarchical attention framework for large language models that reduces computational complexity by dynamically partitioning input sequences and optimally aggregating multi-scale information through game theory and convex optimization.

Contribution

It presents a novel hierarchical attention mechanism with a mathematically rigorous aggregation strategy, improving scalability and global dependency modeling in LLMs.

Findings

01

Achieves 81% reduction in FLOPs at sequence length 4096.

02

Demonstrates superior scalability over standard attention mechanisms.

03

Enables end-to-end training with differentiable optimization layers.

Abstract

The quadratic computational complexity of MultiHead SelfAttention (MHSA) remains a fundamental bottleneck in scaling Large Language Models (LLMs) for longcontext tasks. While sparse and linearized attention mechanisms attempt to mitigate this, they often compromise the representation of global dependencies or fail to capture multiscale semantic granularity effectively. In this paper, we propose Multiscale Aggregated Hierarchical Attention (MAHA), a novel architectural framework that reformulates the attention mechanism through hierarchical decomposition and mathematically rigorous aggregation. Unlike conventional approaches that treat token interactions at a single resolution, MAHA dynamically partitions the input sequence into hierarchical scales via learnable downsampling operators. The core innovation lies in its aggregation strategy: we model the fusion of scalespecific attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Multimodal Machine Learning Applications