Exploring The Visual Feature Space for Multimodal Neural Decoding
Weihao Xia, Cengiz Oztireli

TL;DR
This paper introduces a zero-shot multimodal brain decoding method that leverages pre-trained visual features within Multimodal Large Language Models to improve the granularity and accuracy of neural decoding of visual information.
Contribution
It analyzes different visual feature spaces and proposes a new benchmark for evaluating fine-grained neural decoding across multiple levels of detail.
Findings
Enhanced decoding precision with multimodal models
Effective zero-shot decoding of detailed visual descriptions
Introduction of the MG-BrainDub benchmark for evaluation
Abstract
The intrication of brain signals drives research that leverages multimodal AI to align brain modalities with visual and textual data for explainable descriptions. However, most existing studies are limited to coarse interpretations, lacking essential details on object descriptions, locations, attributes, and their relationships. This leads to imprecise and ambiguous reconstructions when using such cues for visual decoding. To address this, we analyze different choices of vision feature spaces from pre-trained visual components within Multimodal Large Language Models (MLLMs) and introduce a zero-shot multimodal brain decoding method that interacts with these models to decode across multiple levels of granularities. % To assess a model's ability to decode fine details from brain signals, we propose the Multi-Granularity Brain Detail Understanding Benchmark (MG-BrainDub). This benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Digital Media Forensic Detection
MethodsALIGN
