Loading paper
SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization | Tomesphere