Self-Powered LLM Modality Expansion for Large Speech-Text Models

Tengfei Yu; Xuebo Liu; Zhiyi Hou; Liang Ding; Dacheng Tao; Min Zhang

arXiv:2410.03798·cs.CL·October 15, 2024

Self-Powered LLM Modality Expansion for Large Speech-Text Models

Tengfei Yu, Xuebo Liu, Zhiyi Hou, Liang Ding, Dacheng Tao, Min Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a self-powered large speech-text model that reduces speech anchor bias and enhances multimodal integration by leveraging model-generated speech recognition data for instruction tuning.

Contribution

It proposes a novel self-powered approach that uses augmented speech data from the model itself to improve instruction tuning and mitigate bias in large speech-text models.

Findings

01

Mitigates speech anchor bias effectively

02

Improves speech-text modality fusion

03

Enhances instruction-following performance

Abstract

Large language models (LLMs) exhibit remarkable performance across diverse tasks, indicating their potential for expansion into large speech-text models (LSMs) by integrating speech capabilities. Although unified speech-text pre-training and multimodal data instruction-tuning offer considerable benefits, these methods generally entail significant resource demands and tend to overfit specific tasks. This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning. We explore the instruction-following dynamics within LSMs, identifying a critical issue termed speech anchor bias-a tendency for LSMs to over-rely on speech inputs, mistakenly interpreting the entire speech modality as directives, thereby neglecting textual instructions. To counteract this bias, we introduce a self-powered LSM that leverages augmented automatic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ytf-philp/self-powered-lsm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis