DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression
Bingzhou Li, Tao Huang

TL;DR
DASH is a training-free, semantic-aware token compression framework for multimodal models that dynamically segments audio-visual data based on semantic cues, improving efficiency without sacrificing accuracy.
Contribution
DASH introduces a novel, semantic-driven, dynamic segmentation method that aligns token compression with the inherent structure of audio-visual signals, outperforming fixed-window approaches.
Findings
Maintains higher accuracy at increased compression ratios.
Effectively aligns token segmentation with semantic boundaries.
Outperforms prior compression methods on multiple datasets.
Abstract
Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, which overlook the piecewise semantic structure of audio-visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio-driven Semantic cHunking (DASH), a training-free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine-similarity discontinuities, inducing dynamic, variable-length segments that approximate the underlying piecewise-coherent organization of the sequence. These boundaries are projected onto video tokens to establish explicit cross-modal segmentation. Within each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Multimodal Machine Learning Applications · Music and Audio Processing
