TL;DR
Open-TQ-Metal enables long-context inference on Apple Silicon by fused compressed-domain attention, achieving significant speedup and memory reduction while maintaining accuracy.
Contribution
It introduces the first fused compressed-domain attention implementation on Apple Silicon, allowing 128K-context inference for large language models with high efficiency.
Findings
48x attention speedup at 128K context over baseline
Reduces KV cache memory from 40 GB to 12.5 GB
Maintains identical top-1 token predictions to FP16 inference
Abstract
We present Open-TQ-Metal, the first implementation of fused compressed-domain attention on Apple Silicon, enabling 128K-context inference for Llama 3.1 70B on a single 64GB consumer Mac -- a configuration impossible with all existing inference frameworks. Open-TQ-Metal quantizes the KV cache to int4 on the fly and computes attention directly on the compressed representation via custom Metal compute shaders, eliminating all intermediate dequantization matrices. Across 330 experiments spanning two model families (Gemma 4 31B and Llama 3.1 70B), the fused sdpa_int4 kernel achieves 48x attention speedup at 128K context over the dequantize-then-attend baseline, reduces KV cache memory from 40 GB to 12.5 GB (3.2x compression), and maintains identical top-1 token predictions to FP16 inference. We further provide the first cross-architecture analysis of KV cache quantization methods, revealing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
