Prompting Large Language Models with Audio for General-Purpose Speech   Summarization

Wonjune Kang; Deb Roy

arXiv:2406.05968·eess.AS·September 16, 2024

Prompting Large Language Models with Audio for General-Purpose Speech Summarization

Wonjune Kang, Deb Roy

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel framework that enables large language models to perform speech summarization directly from audio inputs by converting speech into token representations, allowing flexible, domain-independent summaries.

Contribution

We introduce an end-to-end system combining an instruction-tuned LLM with an audio encoder, enabling direct speech summarization without relying on intermediate transcription.

Findings

01

Outperforms cascade baseline of speech recognition plus text processing

02

Supports domain-independent speech summarization

03

Produces customizable summaries by varying prompts

Abstract

In this work, we introduce a framework for speech summarization that leverages the processing and reasoning capabilities of large language models (LLMs). We propose an end-to-end system that combines an instruction-tuned LLM with an audio encoder that converts speech into token representations that the LLM can interpret. Using a dataset with paired speech-text data, the overall system is trained to generate consistent responses to prompts with the same semantic information regardless of the input modality. The resulting framework allows the LLM to process speech inputs in the same way as text, enabling speech summarization by simply prompting the LLM. Unlike prior approaches, our method is able to summarize spoken content from any arbitrary domain, and it can produce summaries in different styles by varying the LLM prompting strategy. Experiments demonstrate that our approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wonjune-kang/llm-speech-summarization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing