Beyond Feature Fusion: Contextual Bayesian PEFT for Multimodal Uncertainty Estimation

Habibeh Naderi; Behrouz Haji Soleimani; Stan Matwin

arXiv:2604.16615·cs.LG·April 21, 2026

Beyond Feature Fusion: Contextual Bayesian PEFT for Multimodal Uncertainty Estimation

Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin

PDF

TL;DR

CoCo-LoRA is a novel multimodal, uncertainty-aware parameter-efficient fine-tuning method that incorporates audio context to improve reliability in speech prediction tasks.

Contribution

It introduces a new Bayesian PEFT approach that models audio-driven uncertainty without high-dimensional fusion, enhancing multimodal prediction robustness.

Findings

01

CoCo-LoRA outperforms text-only PEFT and feature-fusion baselines on diverse tasks.

02

It effectively models heteroscedastic uncertainty driven by audio context.

03

The method maintains scalability while incorporating external acoustic information.

Abstract

We introduce CoCo-LoRA, a multimodal, uncertainty-aware parameter-efficient fine-tuning method for text prediction tasks accompanied by audio context. Existing PEFT approaches such as LoRA are efficient but typically deterministic, while recent Bayesian low-rank adapters model uncertainty in a lightweight way yet remain largely unimodal and condition uncertainty primarily on internal text features. This leaves them poorly equipped to reflect uncertainty driven by external acoustic factors such as background noise, channel variability, or speaking style, which can materially affect reliability in speech-centered applications. CoCo-LoRA addresses this gap by conditioning a contextual variational posterior in the low-rank space on both local text-derived adapter features and an audio-derived context signal. A pooled audio embedding is projected once into a shared context space and then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.