Achieving Fine-grained Cross-modal Understanding through Brain-inspired Hierarchical Representation Learning
Weihang You, Hanqi Jiang, Yi Pan, Junhao Chen, Tianming Liu, Fei Dou

TL;DR
This paper introduces NeuroAlign, a brain-inspired hierarchical framework that improves fine-grained cross-modal understanding of neural responses to visual stimuli by modeling hierarchical visual processing and temporal dynamics.
Contribution
NeuroAlign is the first framework to incorporate hierarchical and temporal aspects of visual processing inspired by the human brain for neural-video alignment.
Findings
Outperforms existing methods in cross-modal retrieval tasks.
Effectively models temporal dynamics in neural responses.
Establishes a new paradigm for visual cognitive mechanism understanding.
Abstract
Understanding neural responses to visual stimuli remains challenging due to the inherent complexity of brain representations and the modality gap between neural data and visual inputs. Existing methods, mainly based on reducing neural decoding to generation tasks or simple correlations, fail to reflect the hierarchical and temporal processes of visual processing in the brain. To address these limitations, we present NeuroAlign, a novel framework for fine-grained fMRI-video alignment inspired by the hierarchical organization of the human visual system. Our framework implements a two-stage mechanism that mirrors biological visual pathways: global semantic understanding through Neural-Temporal Contrastive Learning (NTCL) and fine-grained pattern matching through enhanced vector quantization. NTCL explicitly models temporal dynamics through bidirectional prediction between modalities, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace Recognition and Perception · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection
