Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms

Tyler Loakman; Joseph James; Chenghua Lin

arXiv:2511.13225·cs.CL·November 18, 2025

Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms

Tyler Loakman, Joseph James, Chenghua Lin

PDF

Open Access

TL;DR

This paper benchmarks vision-language models on their ability to interpret spectrograms and waveforms of speech, revealing their limited performance and highlighting the need for specialized knowledge in this task.

Contribution

It introduces a novel dataset and evaluation framework to assess VLMs' capacity to interpret speech spectrograms and waveforms, a task they perform poorly on without specific training.

Findings

01

VLMs rarely outperform chance in spectrogram interpretation

02

Zero-shot and finetuned models struggle without specialized knowledge

03

Highlights the gap between VLM capabilities and phonetic understanding

Abstract

With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Language and cultural evolution