TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning

Seongah Kim; Dinh Phu Tran; Hyeontaek Hwang; Saad Wazir; Duc Do Minh; Daeyoung Kim

arXiv:2605.11572·cs.CV·May 14, 2026

TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning

Seongah Kim, Dinh Phu Tran, Hyeontaek Hwang, Saad Wazir, Duc Do Minh, Daeyoung Kim

PDF

TL;DR

This paper introduces TB-AVA, a novel framework that uses text as a semantic bridge to improve parameter-efficient audio-visual understanding, achieving state-of-the-art results.

Contribution

It presents a new text-bridged adapter with gated semantic modulation for effective cross-modal alignment using frozen encoders.

Findings

01

Achieves state-of-the-art performance on AVE, AVS, and AVVP benchmarks.

02

Demonstrates effective use of text as a semantic anchor in audio-visual tasks.

Abstract

Audio-visual understanding requires effective alignment between heterogeneous modalities, yet cross-modal correspondence remains challenging when temporally aligned audio and visual signals lack clear semantic correspondence. We propose to use text as a semantic anchor for audio-visual representation learning. To this end, we introduce a parameter-efficient adaptation framework built on frozen audio and visual encoders, centered on Text-Bridged Audio-Visual Adapter (TB-AVA), which enables text-mediated interaction between audio and visual streams. At the core of TB-AVA, Gated Semantic Modulation (GSM) selectively modulates feature channels based on text-inferred semantic relevance. We evaluate the proposed approach on multiple benchmarks, including AVE, AVS, and AVVP, where the proposed framework achieves state-of-the-art performance, demonstrating text as an effective semantic anchor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.