DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

Weizhe Chen; Miao Zhang; Junpeng Jiang; Yaping Li; Weili Guan; Liqiang Nie

arXiv:2605.20936·cs.LG·May 21, 2026

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

Weizhe Chen, Miao Zhang, Junpeng Jiang, Yaping Li, Weili Guan, Liqiang Nie

PDF

TL;DR

DASH is a rapid, differentiable architecture search method for hybrid attention in large language models, achieving high-quality designs in minutes on a single GPU with minimal data.

Contribution

The paper introduces DASH, a fast differentiable search framework that significantly reduces search time and data requirements for hybrid attention architecture design.

Findings

01

DASH outperforms existing hybrid attention baselines on Qwen2.5-3B-Instruct.

02

DASH achieves better RULER performance than Jet-Nemotron models.

03

Each DASH search takes about 20 minutes on a single GPU with only 12.3M tokens.

Abstract

Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often rely on manual empirical rules or proxy-based selector signals for layer-wise operator allocation. Recent NAS-style systems such as Jet-Nemotron demonstrate the promise of automated hybrid architecture search. However, Jet-Nemotron's PostNAS search stages alone use 200B tokens, making such search pipelines difficult to use as routine methods for hybrid architecture design. We introduce DASH, a fast differentiable search framework for hybrid attention architecture design, which relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.