Towards Multilingual Audio-Visual Question Answering
Orchid Chetia Phukan, Priyabrata Mallick, Swarup Ranjan Behera,, Aalekhya Satya Narayani, Arun Balaji Buduru, Rajesh Sharma

TL;DR
This paper introduces a scalable approach to multilingual audio-visual question answering by leveraging machine translation and foundation models, creating new datasets and benchmarks for eight languages.
Contribution
It presents the MERA framework and datasets for multilingual AVQA, reducing manual annotation and enabling future research in this area.
Findings
Created eight-language AVQA datasets using machine translation.
Proposed MERA framework with multiple model architectures.
Established benchmarks for multilingual AVQA.
Abstract
In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Subtitles and Audiovisual Media
