SGPA: Spectrogram-Guided Phonetic Alignment for Feasible Shapley Value Explanations in Multimodal Large Language Models

Pawe{\l} Pozorski; Jakub Muszy\'nski; Maria Ganzha

arXiv:2603.02250·cs.SD·March 4, 2026

SGPA: Spectrogram-Guided Phonetic Alignment for Feasible Shapley Value Explanations in Multimodal Large Language Models

Pawe{\l} Pozorski, Jakub Muszy\'nski, Maria Ganzha

PDF

Open Access

TL;DR

This paper introduces SGPA, a novel method that aligns audio segments with phonetic boundaries to make Shapley value explanations of large audio language models computationally feasible and more accurate.

Contribution

SGPA combines forced alignment and spectral refinement to produce stable, word-aligned audio segments, significantly reducing computational complexity for model explanations.

Findings

01

Achieves 43× reduction in model evaluations

02

Significantly alters attribution concentration

03

Preserves global attribution profile

Abstract

Explaining the behavior of end-to-end audio language models via Shapley value attribution is intractable under native tokenization: a typical utterance yields over $150$ encoder frames, inflating the coalition space by roughly $1 0^{42}$ relative to text; individual audio frames lack standalone meaning; and token boundaries that bisect phonetic transitions introduce masking artifacts. We introduce Spectrogram-Guided Phonetic Alignment (SGPA), a four-stage pipeline that combines Connectionist Temporal Classification forced alignment with spectral boundary refinement to produce acoustically stable, word-aligned audio segments. Controlled diagnostics on LFM2-Audio-1.5B with VoiceBench show that SGPA yields a 43 $\times$ reduction in model evaluations. Statistical testing confirms that SGPA significantly alters attribution concentration while preserving the global cumulative profile,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Speech Recognition and Synthesis · Music and Audio Processing