Joint rotational invariance and adversarial training of a dual-stream Transformer yields state of the art Brain-Score for Area V4
William Berrios, Arturo Deza

TL;DR
This paper demonstrates that a dual-stream Transformer model, optimized with joint rotational invariance and adversarial training, achieves top performance in predicting brain responses in visual areas, surpassing CNNs in some metrics.
Contribution
It introduces a novel joint optimization approach for Vision Transformers that improves their alignment with human visual brain responses, achieving state-of-the-art Brain-Score results.
Findings
Achieved 2nd place in Brain-Score 2022 competition
Outperformed ResNet50 in explainable variance for V4, IT, and behavior
Joint optimization enhances model robustness and interpretability
Abstract
Modern high-scoring models of vision in the brain score competition do not stem from Vision Transformers. However, in this paper, we provide evidence against the unexpected trend of Vision Transformers (ViT) being not perceptually aligned with human visual representations by showing how a dual-stream Transformer, a CrossViT Chen et al. (2021), under a joint rotationally-invariant and adversarial optimization procedure yields 2nd place in the aggregate Brain-Score 2022 competition(Schrimpf et al., 2020b) averaged across all visual categories, and at the time of the competition held 1st place for the highest explainable variance of area V4. In addition, our current Transformer-based model also achieves greater explainable variance for areas V4, IT and Behaviour than a biologically-inspired CNN (ResNet50) that integrates a frontal V1-like computation module (Dapello et…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace Recognition and Perception · EEG and Brain-Computer Interfaces · Adversarial Robustness in Machine Learning
MethodsEXP-$Does Expedia refund a cancelled flight? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Concatenated Skip Connection · CrossViT · Dropout · Dense Connections · Residual Connection · Layer Normalization
