BIMM: Brain Inspired Masked Modeling for Video Representation Learning

Zhifan Wan; Jie Zhang; Changzhen Li; Shiguang Shan

arXiv:2405.12757·cs.CV·May 22, 2024

BIMM: Brain Inspired Masked Modeling for Video Representation Learning

Zhifan Wan, Jie Zhang, Changzhen Li, Shiguang Shan

PDF

Open Access 1 Repo

TL;DR

BIMM is a brain-inspired framework that uses dual-branch masked modeling with shared parameters to learn comprehensive video representations, inspired by the ventral and dorsal pathways of the human visual system.

Contribution

This work introduces a novel dual-branch masked modeling approach with partial parameter sharing, inspired by human visual pathways, for improved video representation learning.

Findings

01

Outperforms state-of-the-art methods on video tasks

02

Effective in capturing both object and motion information

03

Demonstrates the benefit of brain-inspired architecture

Abstract

The visual pathway of human brain includes two sub-pathways, ie, the ventral pathway and the dorsal pathway, which focus on object identification and dynamic information modeling, respectively. Both pathways comprise multi-layer structures, with each layer responsible for processing different aspects of visual information. Inspired by visual information processing mechanism of the human brain, we propose the Brain Inspired Masked Modeling (BIMM) framework, aiming to learn comprehensive representations from videos. Specifically, our approach consists of ventral and dorsal branches, which learn image and video representations, respectively. Both branches employ the Vision Transformer (ViT) as their backbone and are trained using masked modeling method. To achieve the goals of different visual cortices in the brain, we segment the encoder of each branch into three intermediate blocks and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tonyalbertwan/bimm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications

MethodsLinear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings · Byte Pair Encoding · Adam · Dropout · Softmax · Focus · Multi-Head Attention