What do MLLMs hear? Examining reasoning with text and sound components   in Multimodal Large Language Models

Enis Berk \c{C}oban; Michael I. Mandel; Johanna Devaney

arXiv:2406.04615·eess.AS·June 10, 2024

What do MLLMs hear? Examining reasoning with text and sound components in Multimodal Large Language Models

Enis Berk \c{C}oban, Michael I. Mandel, Johanna Devaney

PDF

Open Access

TL;DR

This paper investigates the reasoning capabilities of multimodal large language models (MLLMs) involving sound and text, revealing limitations in leveraging LLM reasoning for audio classification due to separate modality representations.

Contribution

It provides an analysis of how MLLMs represent audio and text separately, highlighting challenges in using LLM reasoning for audio classification tasks.

Findings

01

MLLMs struggle to fully utilize LLM reasoning for audio captioning

02

Separate modality representations hinder reasoning pathways

03

Audio and text are represented independently in MLLMs

Abstract

Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, notably in connecting ideas and adhering to logical rules to solve problems. These models have evolved to accommodate various data modalities, including sound and images, known as multimodal LLMs (MLLMs), which are capable of describing images or sound recordings. Previous work has demonstrated that when the LLM component in MLLMs is frozen, the audio or visual encoder serves to caption the sound or image input facilitating text-based reasoning with the LLM component. We are interested in using the LLM's reasoning capabilities in order to facilitate classification. In this paper, we demonstrate through a captioning/classification experiment that an audio MLLM cannot fully leverage its LLM's text-based reasoning when generating audio captions. We also consider how this may be due to MLLMs separately…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling