Hybrid Dual-Path Linear Transformations for Efficient Transformer Architectures
Vladimer Khasia

TL;DR
This paper introduces the HDPL operator, a hybrid approach combining local sparse and global low-rank transformations within Transformers, leading to more efficient models with better performance and interpretability.
Contribution
The paper proposes the Hybrid Dual-Path Linear (HDPL) operator that decomposes affine transformations into local and global pathways, enhancing efficiency and representational power in Transformer architectures.
Findings
Outperforms standard Llama baseline on FineWeb-Edu dataset
Reduces parameter count by 6.8% while improving validation loss
Provides a probabilistic latent space for interpretability and control
Abstract
Standard Transformer architectures rely heavily on dense linear transformations, treating feature projection as a monolithic, full-rank operation. We argue that this formulation is inefficient and lacks the structural inductive bias necessary for distinguishing between local feature preservation and global context integration. To address this, we introduce the Hybrid Dual-Path Linear (HDPL) operator, which decomposes the affine transformation into two topologically distinct pathways: a sparse block-diagonal component for high-rank local processing, and a low-rank Variational Autoencoder (VAE) bottleneck for global context regularization. By "surgically" replacing specific projections (Query, Key, Value, Gate, Up) with HDPL operators while retaining standard dense layers for aggregation (Output, Down), we achieve a superior balance of efficiency and representational power. Experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
