Listen Then See: Video Alignment with Speaker Attention

Aviral Agrawal; Carlos Mateo Samudio Lezcano; Iqui Balam; Heredia-Marin; Prabhdeep Singh Sethi (Carnegie Mellon University)

arXiv:2404.13530·cs.CV·April 23, 2024

Listen Then See: Video Alignment with Speaker Attention

Aviral Agrawal, Carlos Mateo Samudio Lezcano, Iqui Balam, Heredia-Marin, Prabhdeep Singh Sethi (Carnegie Mellon University)

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel cross-modal alignment method that enhances video question answering by effectively integrating audio and language modalities, achieving state-of-the-art accuracy on the Social IQ 2.0 dataset.

Contribution

It proposes a new approach that uses audio as a bridge to better align and fuse video and language modalities in SIQA tasks, addressing modality dominance issues.

Findings

01

Achieved 82.06% accuracy on Social IQ 2.0 dataset.

02

Improved utilization of video modality through audio-bridge alignment.

03

Reduced language overfitting and video bypassing in multimodal fusion.

Abstract

Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, and the integration of multimodal information, but in addition, it requires processing nuanced human behavior. Furthermore, the complexities involved are exacerbated by the dominance of the primary modality (text) over the others. Thus, there is a need to help the task's secondary modalities to work in tandem with the primary modality. In this work, we introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results (82.06\% accuracy) on the Social IQ 2.0 dataset for SIQA. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sts-vlcc/sts-vlcc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Subtitles and Audiovisual Media