Beyond Feature Fusion: Contextual Bayesian PEFT for Multimodal Uncertainty Estimation
Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin

TL;DR
CoCo-LoRA is a novel multimodal, uncertainty-aware parameter-efficient fine-tuning method that incorporates audio context to improve reliability in speech prediction tasks.
Contribution
It introduces a new Bayesian PEFT approach that models audio-driven uncertainty without high-dimensional fusion, enhancing multimodal prediction robustness.
Findings
CoCo-LoRA outperforms text-only PEFT and feature-fusion baselines on diverse tasks.
It effectively models heteroscedastic uncertainty driven by audio context.
The method maintains scalability while incorporating external acoustic information.
Abstract
We introduce CoCo-LoRA, a multimodal, uncertainty-aware parameter-efficient fine-tuning method for text prediction tasks accompanied by audio context. Existing PEFT approaches such as LoRA are efficient but typically deterministic, while recent Bayesian low-rank adapters model uncertainty in a lightweight way yet remain largely unimodal and condition uncertainty primarily on internal text features. This leaves them poorly equipped to reflect uncertainty driven by external acoustic factors such as background noise, channel variability, or speaking style, which can materially affect reliability in speech-centered applications. CoCo-LoRA addresses this gap by conditioning a contextual variational posterior in the low-rank space on both local text-derived adapter features and an audio-derived context signal. A pooled audio embedding is projected once into a shared context space and then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
