RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

Subrata Biswas; Mohammad Nur Hossain Khan; Bashima Islam

arXiv:2505.17114·cs.CL·September 8, 2025

RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam

PDF

1 Repo 1 Models 1 Video

TL;DR

RAVEN introduces a query-guided multimodal QA model that effectively identifies relevant signals across audio, video, and sensor data, improving accuracy and robustness in multi-modal reasoning tasks.

Contribution

The paper presents RAVEN, a novel architecture with query-conditioned gating and a three-stage training pipeline, along with a new AVS-QA dataset for multimodal question answering.

Findings

01

Achieves up to 14.5% accuracy improvement over state-of-the-art models.

02

Incorporating sensor data boosts performance by 16.4%.

03

Remains robust under modality corruption, outperforming baselines by 50.23%.

Abstract

Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bashlab/raven
pytorchOfficial

Models

🤗
BASH-Lab/RAVEN-AV-7B
model· 2 dl
2 dl

Videos

RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language· underline