SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization
Tan-Hanh Pham, Hoang-Nam Le, Phu-Vinh Nguyen, Chris Ngo, and, Truong-Son Hy

TL;DR
SilVar is an innovative multimodal model that processes speech instructions for visual reasoning tasks, achieving state-of-the-art results and enabling more natural human-machine interactions.
Contribution
We introduce SilVar, the first end-to-end model using speech instructions for visual question answering and reasoning, along with a new dataset for speech-based reasoning tasks.
Findings
SilVar achieves SOTA on MMMU and ScienceQA benchmarks.
The model effectively handles speech-based reasoning and object localization.
Speech instructions improve interaction naturalness and reasoning capabilities.
Abstract
Visual Language Models have demonstrated remarkable capabilities across tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in human-machine interactions. Moreover, the quality of language models depends on reasoning and prompting techniques, such as COT, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, a novel end-to-end multimodal model that uses speech instructions for reasoning in visual question answering. In addition, we investigate reasoning techniques with levels including conversational, simple, and complex speech instruction. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling intuitive interactions by allowing users to provide verbal or text instructions. To this end, we introduce a dataset designed to challenge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Robotics and Automated Systems
MethodsContrastive Language-Image Pre-training · LLaMA
