DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU
Weizhe Chen, Miao Zhang, Junpeng Jiang, Yaping Li, Weili Guan, Liqiang Nie

TL;DR
DASH is a rapid, differentiable architecture search method for hybrid attention in large language models, achieving high-quality designs in minutes on a single GPU with minimal data.
Contribution
The paper introduces DASH, a fast differentiable search framework that significantly reduces search time and data requirements for hybrid attention architecture design.
Findings
DASH outperforms existing hybrid attention baselines on Qwen2.5-3B-Instruct.
DASH achieves better RULER performance than Jet-Nemotron models.
Each DASH search takes about 20 minutes on a single GPU with only 12.3M tokens.
Abstract
Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often rely on manual empirical rules or proxy-based selector signals for layer-wise operator allocation. Recent NAS-style systems such as Jet-Nemotron demonstrate the promise of automated hybrid architecture search. However, Jet-Nemotron's PostNAS search stages alone use 200B tokens, making such search pipelines difficult to use as routine methods for hybrid architecture design. We introduce DASH, a fast differentiable search framework for hybrid attention architecture design, which relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
