DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression

Bingzhou Li; Tao Huang

arXiv:2603.15685·cs.MM·March 18, 2026

DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression

Bingzhou Li, Tao Huang

PDF

Open Access

TL;DR

DASH is a training-free, semantic-aware token compression framework for multimodal models that dynamically segments audio-visual data based on semantic cues, improving efficiency without sacrificing accuracy.

Contribution

DASH introduces a novel, semantic-driven, dynamic segmentation method that aligns token compression with the inherent structure of audio-visual signals, outperforming fixed-window approaches.

Findings

01

Maintains higher accuracy at increased compression ratios.

02

Effectively aligns token segmentation with semantic boundaries.

03

Outperforms prior compression methods on multiple datasets.

Abstract

Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, which overlook the piecewise semantic structure of audio-visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio-driven Semantic cHunking (DASH), a training-free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine-similarity discontinuities, inducing dynamic, variable-length segments that approximate the underlying piecewise-coherent organization of the sequence. These boundaries are projected onto video tokens to establish explicit cross-modal segmentation. Within each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Multimodal Machine Learning Applications · Music and Audio Processing