What do MLLMs hear? Examining reasoning with text and sound components in Multimodal Large Language Models
Enis Berk \c{C}oban, Michael I. Mandel, Johanna Devaney

TL;DR
This paper investigates the reasoning capabilities of multimodal large language models (MLLMs) involving sound and text, revealing limitations in leveraging LLM reasoning for audio classification due to separate modality representations.
Contribution
It provides an analysis of how MLLMs represent audio and text separately, highlighting challenges in using LLM reasoning for audio classification tasks.
Findings
MLLMs struggle to fully utilize LLM reasoning for audio captioning
Separate modality representations hinder reasoning pathways
Audio and text are represented independently in MLLMs
Abstract
Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, notably in connecting ideas and adhering to logical rules to solve problems. These models have evolved to accommodate various data modalities, including sound and images, known as multimodal LLMs (MLLMs), which are capable of describing images or sound recordings. Previous work has demonstrated that when the LLM component in MLLMs is frozen, the audio or visual encoder serves to caption the sound or image input facilitating text-based reasoning with the LLM component. We are interested in using the LLM's reasoning capabilities in order to facilitate classification. In this paper, we demonstrate through a captioning/classification experiment that an audio MLLM cannot fully leverage its LLM's text-based reasoning when generating audio captions. We also consider how this may be due to MLLMs separately…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling
