SilVar: Speech Driven Multimodal Model for Reasoning Visual Question   Answering and Object Localization

Tan-Hanh Pham; Hoang-Nam Le; Phu-Vinh Nguyen; Chris Ngo; and; Truong-Son Hy

arXiv:2412.16771·cs.CV·December 24, 2024

SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization

Tan-Hanh Pham, Hoang-Nam Le, Phu-Vinh Nguyen, Chris Ngo, and, Truong-Son Hy

PDF

Open Access 1 Repo

TL;DR

SilVar is an innovative multimodal model that processes speech instructions for visual reasoning tasks, achieving state-of-the-art results and enabling more natural human-machine interactions.

Contribution

We introduce SilVar, the first end-to-end model using speech instructions for visual question answering and reasoning, along with a new dataset for speech-based reasoning tasks.

Findings

01

SilVar achieves SOTA on MMMU and ScienceQA benchmarks.

02

The model effectively handles speech-based reasoning and object localization.

03

Speech instructions improve interaction naturalness and reasoning capabilities.

Abstract

Visual Language Models have demonstrated remarkable capabilities across tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in human-machine interactions. Moreover, the quality of language models depends on reasoning and prompting techniques, such as COT, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, a novel end-to-end multimodal model that uses speech instructions for reasoning in visual question answering. In addition, we investigate reasoning techniques with levels including conversational, simple, and complex speech instruction. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling intuitive interactions by allowing users to provide verbal or text instructions. To this end, we introduce a dataset designed to challenge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hanhpt23/silvar
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Robotics and Automated Systems

MethodsContrastive Language-Image Pre-training · LLaMA