TL;DR
RAVEN introduces a query-guided multimodal QA model that effectively identifies relevant signals across audio, video, and sensor data, improving accuracy and robustness in multi-modal reasoning tasks.
Contribution
The paper presents RAVEN, a novel architecture with query-conditioned gating and a three-stage training pipeline, along with a new AVS-QA dataset for multimodal question answering.
Findings
Achieves up to 14.5% accuracy improvement over state-of-the-art models.
Incorporating sensor data boosts performance by 16.4%.
Remains robust under modality corruption, outperforming baselines by 50.23%.
Abstract
Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
