Prompting Large Language Models with Audio for General-Purpose Speech Summarization
Wonjune Kang, Deb Roy

TL;DR
This paper presents a novel framework that enables large language models to perform speech summarization directly from audio inputs by converting speech into token representations, allowing flexible, domain-independent summaries.
Contribution
We introduce an end-to-end system combining an instruction-tuned LLM with an audio encoder, enabling direct speech summarization without relying on intermediate transcription.
Findings
Outperforms cascade baseline of speech recognition plus text processing
Supports domain-independent speech summarization
Produces customizable summaries by varying prompts
Abstract
In this work, we introduce a framework for speech summarization that leverages the processing and reasoning capabilities of large language models (LLMs). We propose an end-to-end system that combines an instruction-tuned LLM with an audio encoder that converts speech into token representations that the LLM can interpret. Using a dataset with paired speech-text data, the overall system is trained to generate consistent responses to prompts with the same semantic information regardless of the input modality. The resulting framework allows the LLM to process speech inputs in the same way as text, enabling speech summarization by simply prompting the LLM. Unlike prior approaches, our method is able to summarize spoken content from any arbitrary domain, and it can produce summaries in different styles by varying the LLM prompting strategy. Experiments demonstrate that our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
