A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

Nicolas Calbucura; Jose Guillen; Valentin Barriere

arXiv:2512.07571·cs.CL·April 7, 2026

A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

Nicolas Calbucura, Jose Guillen, Valentin Barriere

PDF

1 Repo

TL;DR

This paper introduces a straightforward approach to augment pre-trained language models with speech tokens for classification, improving performance by selecting key audio features through a simple feature selection method.

Contribution

The method combines speech tokenization with feature selection and self-supervised learning to enhance language models with speech information efficiently.

Findings

01

Improved classification performance over unimodal models and larger SpeechLMs.

02

Even random audio token selection can enhance unimodal models.

03

Effective on Argumentative Fallacy Detection and affective computing tasks.

Abstract

This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

salocinc/EACL26SpeechTokFallacy
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.