Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
Xiangyu Zhang, Benjamin John Southwell, Siqi Pan, Xinlei Niu, Beena Ahmed, Julien Epps

TL;DR
This paper introduces a novel timing-aware pre-quantization fusion method for video-enhanced audio tokenization, effectively integrating visual data while maintaining high audio reconstruction quality and improving downstream task performance.
Contribution
It identifies key factors affecting reconstruction in multimodal audio-visual tokenization and proposes a new fusion approach that preserves fidelity and enhances understanding tasks.
Findings
Fusion location impacts reconstruction quality
Contrastive learning is unsuitable for discrete tokenizers
Temporal axis fusion yields superior results
Abstract
Audio tokenization has emerged as a critical component in end-to-end audio language models, enabling efficient discrete representation learning for both audio understanding and generation tasks. However, existing audio tokenizers face fundamental limitations in understanding tasks due to single-modality constraints, particularly when audio signals contain ambiguous or incomplete information. While incorporating additional modality information can significantly enhance audio understanding, current multimodal fusion approaches invariably degrade reconstruction quality. This degradation is unacceptable for end-to-end audio systems that require high-fidelity audio generation capabilities. In this work, we investigate the root causes of reconstruction quality degradation in video-enhanced audio tokenization and present three key findings. First, the location of fusion within the tokenizer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
