Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon

Sai Vegasena

arXiv:2604.16957·cs.LG·April 21, 2026

Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon

Sai Vegasena

PDF

1 Models

TL;DR

Open-TQ-Metal enables long-context inference on Apple Silicon by fused compressed-domain attention, achieving significant speedup and memory reduction while maintaining accuracy.

Contribution

It introduces the first fused compressed-domain attention implementation on Apple Silicon, allowing 128K-context inference for large language models with high efficiency.

Findings

01

48x attention speedup at 128K context over baseline

02

Reduces KV cache memory from 40 GB to 12.5 GB

03

Maintains identical top-1 token predictions to FP16 inference

Abstract

We present Open-TQ-Metal, the first implementation of fused compressed-domain attention on Apple Silicon, enabling 128K-context inference for Llama 3.1 70B on a single 64GB consumer Mac -- a configuration impossible with all existing inference frameworks. Open-TQ-Metal quantizes the KV cache to int4 on the fly and computes attention directly on the compressed representation via custom Metal compute shaders, eliminating all intermediate dequantization matrices. Across 330 experiments spanning two model families (Gemma 4 31B and Llama 3.1 70B), the fused sdpa_int4 kernel achieves 48x attention speedup at 128K context over the dequantize-then-attend baseline, reduces KV cache memory from 40 GB to 12.5 GB (3.2x compression), and maintains identical top-1 token predictions to FP16 inference. We further provide the first cross-architecture analysis of KV cache quantization methods, revealing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
EnsueAI/metal-int4-sdpa
model· 3 dl· ♡ 6
3 dl♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.