Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models
Rining Wu, Feixiang Zhou, Ziwei Yin, Jian K. Liu

TL;DR
This paper introduces Vi-ST, a spatiotemporal neural network model that aligns brain neuronal responses with dynamic visual scenes, advancing understanding of temporal visual coding in the brain.
Contribution
It presents a novel Vi-ST model combining self-supervised Vision Transformer with spatiotemporal modules to decode neuronal responses to natural videos.
Findings
Vi-ST achieves robust generalization in predicting neuronal responses.
Ablation studies highlight the importance of each temporal module.
The new metric effectively evaluates temporal aspects of visual coding.
Abstract
Our brains represent the ever-changing environment with neurons in a highly dynamic fashion. The temporal features of visual pixels in dynamic natural scenes are entrapped in the neuronal responses of the retina. It is crucial to establish the intrinsic temporal relationship between visual pixels and neuronal responses. Recent foundation vision models have paved an advanced way of understanding image pixels. Yet, neuronal coding in the brain largely lacks a deep understanding of its alignment with pixels. Most previous studies employ static images or artificial videos derived from static images for emulating more real and complicated stimuli. Despite these simple scenarios effectively help to separate key factors influencing visual coding, complex temporal relationships receive no consideration. To decompose the temporal features of visual coding in natural scenes, here we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Visual perception and processing mechanisms
MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Linear Layer · Label Smoothing · Adam · Dropout · Multi-Head Attention · Dense Connections · Vision Transformer
