SGPA: Spectrogram-Guided Phonetic Alignment for Feasible Shapley Value Explanations in Multimodal Large Language Models
Pawe{\l} Pozorski, Jakub Muszy\'nski, Maria Ganzha

TL;DR
This paper introduces SGPA, a novel method that aligns audio segments with phonetic boundaries to make Shapley value explanations of large audio language models computationally feasible and more accurate.
Contribution
SGPA combines forced alignment and spectral refinement to produce stable, word-aligned audio segments, significantly reducing computational complexity for model explanations.
Findings
Achieves 43× reduction in model evaluations
Significantly alters attribution concentration
Preserves global attribution profile
Abstract
Explaining the behavior of end-to-end audio language models via Shapley value attribution is intractable under native tokenization: a typical utterance yields over encoder frames, inflating the coalition space by roughly relative to text; individual audio frames lack standalone meaning; and token boundaries that bisect phonetic transitions introduce masking artifacts. We introduce Spectrogram-Guided Phonetic Alignment (SGPA), a four-stage pipeline that combines Connectionist Temporal Classification forced alignment with spectral boundary refinement to produce acoustically stable, word-aligned audio segments. Controlled diagnostics on LFM2-Audio-1.5B with VoiceBench show that SGPA yields a 43 reduction in model evaluations. Statistical testing confirms that SGPA significantly alters attribution concentration while preserving the global cumulative profile,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Speech Recognition and Synthesis · Music and Audio Processing
