Spoken question answering for visual queries

Nimrod Shabtay; Zvi Kons; Avihu Dekel; Hagai Aronowitz; Ron Hoory; Assaf Arbelle

arXiv:2505.23308·eess.AS·May 30, 2025

Spoken question answering for visual queries

Nimrod Shabtay, Zvi Kons, Avihu Dekel, Hagai Aronowitz, Ron Hoory, Assaf Arbelle

PDF

Open Access

TL;DR

This paper introduces a multi-modal system for spoken visual question answering that fuses speech, text, and images, and demonstrates that synthesized speech can effectively train such models.

Contribution

It presents the first approach to spoken VQA using synthesized speech data, enabling training without a dedicated multi-modal dataset.

Findings

01

Synthesized speech enables effective training of spoken VQA models.

02

Model trained on synthesized speech nearly matches performance of text-based models.

03

Choice of TTS model has minimal impact on accuracy.

Abstract

Question answering (QA) systems are designed to answer natural language questions. Visual QA (VQA) and Spoken QA (SQA) systems extend the textual QA system to accept visual and spoken input respectively. This work aims to create a system that enables user interaction through both speech and images. That is achieved through the fusion of text, speech, and image modalities to tackle the task of spoken VQA (SVQA). The resulting multi-modal model has textual, visual, and spoken inputs and can answer spoken questions on images. Training and evaluating SVQA models requires a dataset for all three modalities, but no such dataset currently exists. We address this problem by synthesizing VQA datasets using two zero-shot TTS models. Our initial findings indicate that a model trained only with synthesized speech nearly reaches the performance of the upper-bounding model trained on textual QAs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Speech and dialogue systems