Towards Multilingual Audio-Visual Question Answering

Orchid Chetia Phukan; Priyabrata Mallick; Swarup Ranjan Behera,; Aalekhya Satya Narayani; Arun Balaji Buduru; Rajesh Sharma

arXiv:2406.09156·cs.LG·June 14, 2024

Towards Multilingual Audio-Visual Question Answering

Orchid Chetia Phukan, Priyabrata Mallick, Swarup Ranjan Behera,, Aalekhya Satya Narayani, Arun Balaji Buduru, Rajesh Sharma

PDF

Open Access 1 Repo

TL;DR

This paper introduces a scalable approach to multilingual audio-visual question answering by leveraging machine translation and foundation models, creating new datasets and benchmarks for eight languages.

Contribution

It presents the MERA framework and datasets for multilingual AVQA, reducing manual annotation and enabling future research in this area.

Findings

01

Created eight-language AVQA datasets using machine translation.

02

Proposed MERA framework with multiple model architectures.

03

Established benchmarks for multilingual AVQA.

Abstract

In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

swarupbehera/mAVQA
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Subtitles and Audiovisual Media