A foundation model of vision, audition, and language for in-silico neuroscience
St\'ephane d'Ascoli, J\'er\'emy Rapin, Yohann Benchetrit, Teon Brooks, Katelyn Begany, Jos\'ephine Raugel, Hubert Banville, Jean-R\'emi King

TL;DR
TRIBE v2 is a tri-modal foundation model that predicts human brain activity across various stimuli and tasks, unifying cognitive neuroscience with AI and enabling in silico experiments.
Contribution
The paper introduces TRIBE v2, a novel tri-modal model integrating vision, audition, and language to predict brain responses, surpassing traditional models and revealing neural topography.
Findings
Accurately predicts brain responses for new stimuli and subjects.
Outperforms traditional linear encoding models in accuracy.
Enables in silico experiments reproducing empirical neuroscience results.
Abstract
Cognitive neuroscience is fragmented into specialized models, each tailored to specific experimental paradigms, hence preventing a unified model of cognition in the human brain. Here, we introduce TRIBE v2, a tri-modal (video, audio and language) foundation model capable of predicting human brain activity in a variety of naturalistic and experimental conditions. Leveraging a unified dataset of over 1,000 hours of fMRI across 720 subjects, we demonstrate that our model accurately predicts high-resolution brain responses for novel stimuli, tasks and subjects, superseding traditional linear encoding models, delivering several-fold improvements in accuracy. Critically, TRIBE v2 enables in silico experimentation: tested on seminal visual and neuro-linguistic paradigms, it recovers a variety of results established by decades of empirical research. Finally, by extracting interpretable latent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
