Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision   Models

Rining Wu; Feixiang Zhou; Ziwei Yin; Jian K. Liu

arXiv:2407.10737·cs.CV·July 16, 2024

Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

Rining Wu, Feixiang Zhou, Ziwei Yin, Jian K. Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces Vi-ST, a spatiotemporal neural network model that aligns brain neuronal responses with dynamic visual scenes, advancing understanding of temporal visual coding in the brain.

Contribution

It presents a novel Vi-ST model combining self-supervised Vision Transformer with spatiotemporal modules to decode neuronal responses to natural videos.

Findings

01

Vi-ST achieves robust generalization in predicting neuronal responses.

02

Ablation studies highlight the importance of each temporal module.

03

The new metric effectively evaluates temporal aspects of visual coding.

Abstract

Our brains represent the ever-changing environment with neurons in a highly dynamic fashion. The temporal features of visual pixels in dynamic natural scenes are entrapped in the neuronal responses of the retina. It is crucial to establish the intrinsic temporal relationship between visual pixels and neuronal responses. Recent foundation vision models have paved an advanced way of understanding image pixels. Yet, neuronal coding in the brain largely lacks a deep understanding of its alignment with pixels. Most previous studies employ static images or artificial videos derived from static images for emulating more real and complicated stimuli. Despite these simple scenarios effectively help to separate key factors influencing visual coding, complex temporal relationships receive no consideration. To decompose the temporal features of visual coding in natural scenes, here we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wurining/Vi-ST
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Visual perception and processing mechanisms

MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Linear Layer · Label Smoothing · Adam · Dropout · Multi-Head Attention · Dense Connections · Vision Transformer