BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs
Abbas Ghaddar, Ivan Kobyzev, Boxing Chen, Yufei Cui

TL;DR
BOSCH is a training-free, black-box optimization method that improves short-context attention-head selection in large language models by adaptively optimizing head importance and ratios, outperforming static heuristics.
Contribution
It introduces BOSCH, a novel black-box binary optimization approach for dynamic head selection in LLMs, addressing limitations of static ranking methods.
Findings
BOSCH outperforms static head-level methods across multiple LLMs and SWA ratios.
It enables faster recovery of long-context performance during continual pretraining.
Head importance shows substantial turnover, highlighting the need for adaptive selection.
Abstract
Post-training hybridization of large language models (LLMs) often replaces quadratic self-attention with sliding-window attention (SWA) to reduce KV cache usage and improve latency. Existing hybridization schemes are typically defined either at the layer level (e.g., interleaving) or at the head level via static rankings from local to global. Layer-level schemes ignore that local and global dependencies are routed through heads within the same layer, while static head-level rankings suffer from entanglement: a head's local/global behavior can change after hybridization. We propose BOSCH, Black-box Binary Optimization for Short-context Head Selection, a training-free method that formulates the problem as a Large Neighborhood Search and decomposes it into three subproblems: (i) layer-importance detection via small-budget black-box probes, (ii) adaptive per-layer SWA-ratio assignment based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
