Audio-visual training for improved grounding in video-text LLMs
Shivprasad Sagare, Hemachandran S, Kinshuk Sarabhai, Prashant, Ullegaddi, Rajeshkumar SA

TL;DR
This paper introduces an audio-visual model architecture for video-text understanding, demonstrating that incorporating audio data improves grounding and comprehension in multimodal video tasks, supported by a new benchmark dataset.
Contribution
The paper presents a novel audio-visual training approach for video-language models and releases a new dataset for evaluating audio-aware question-answering.
Findings
Training on audio-visual data improves grounding accuracy.
Audio data enhances video understanding compared to vision-only models.
A new benchmark dataset enables better evaluation of audio-visual models.
Abstract
Recent advances in multimodal LLMs, have led to several video-text models being proposed for critical video-related tasks. However, most of the previous works support visual input only, essentially muting the audio signal in the video. Few models that support both audio and visual input, are not explicitly trained on audio data. Hence, the effect of audio towards video understanding is largely unexplored. To this end, we propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. Comparison with vision-only baselines, and other audio-visual models showcase that training on audio data indeed leads to improved grounding of responses. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset, with audio-aware question-answer pairs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimedia Communication and Technology · Video Analysis and Summarization
