TL;DR
BEAST is a scalable self-supervised transformer-based framework that enhances neuro-behavioral analysis from videos, reducing reliance on labeled data and improving performance across multiple tasks and species.
Contribution
It introduces BEAST, a novel pretraining approach combining masked autoencoding and contrastive learning for versatile, experiment-specific behavioral video analysis.
Findings
Improved correlation between behavioral features and neural activity.
Enhanced pose estimation accuracy.
Effective action segmentation in multi-animal videos.
Abstract
The brain can only be fully understood through the lens of the behavior it generates -- a guiding principle in modern neuroscience research that nevertheless presents significant technical challenges. Many studies capture behavior with cameras, but video analysis approaches typically rely on specialized models requiring extensive labeled data. We address this limitation with BEAST(BEhavioral Analysis via Self-supervised pretraining of Transformers), a novel and scalable framework that pretrains experiment-specific vision transformers for diverse neuro-behavior analyses. BEAST combines masked autoencoding with temporal contrastive learning to effectively leverage unlabeled video data. Through comprehensive evaluation across multiple species, we demonstrate improved performance in three critical neuro-behavioral tasks: extracting behavioral features that correlate with neural activity,…
Peer Reviews
Decision·ICLR 2026 Poster
- Ablations and comparisions are very extensive and clearly outlined - Paper is very well written. Dense, but very informative and to the point. - The figures have high quality and are far above the average of ICLR papers - The model has convincing performance across the different tasks.
While the method presented seems powerful for a variety of tasks, the evaluation as-is is currently too weak; it seems like BEAST is largely building on existing video-pre-training schemes. This in itself might be fine, but from e.g. Table 1 and 4 it seems that even frozen backbone models are suitable for solving the neural encoding tasks. The authors need to better delineate what their methodological contribution is, and what it adds over a strong baseline model. I would e.g. consider doing a
1. Cleverly tailors self-supervised pretraining to behavioral videos (static background, movement-driven variation) via selection/sampling strategies. 3. Evaluates frozen-backbone features on segmentation and zero-/few-shot neural encoding, with systematic ablations. 3. Clear writing. 4. On action segmentation, BEAST bypasses pose-estimation training while matching or exceeding keypoint-based systems, reducing months-long annotation pipelines.
1. Cross-domain generalization. The approach emphasizes single-paradigm pretraining; comprehensive tests across cameras/environments/species would strengthen claims of generality. 2. Efficiency. The paper positions BEAST as more efficient than native video models but does not report controlled FLOPs/memory/runtime. 3. Interpretability. Beyond showing CLS superiority for pretraining, adding feature visualizations or sensitivity analyses linking learned features to anatomical/behavioral elements
Comprehensive Evaluation: The paper thoroughly evaluates BEAST across multiple tasks (neural encoding, pose estimation, action segmentation) and datasets (mice, fish), demonstrating its versatility and robustness. Neural Encoding Focus: The inclusion of neural encoding as a downstream task is a significant and compelling contribution, linking behavioral video analysis directly to neural activity—a fundamental goal in neuroscience. Efficient Use of Unlabeled Data: BEAST effectively leverages un
1. Limited Discussion of Related Self-Supervised Methods: While the paper discusses general self-supervised learning (SSL) methods like MAE and contrastive learning, it does not adequately address existing SSL approaches specifically designed for animal behavior. For example: ConstrastivePose: A contrastive learning approach for self-supervised feature engineering for pose estimation and behavorial classification of interacting animals. Tianxun Zhou, Calvin Chee Hoe Cheah, Eunice Wei
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
