TL;DR
TRIBE is a deep neural network that integrates text, audio, and video representations to predict whole-brain fMRI responses across multiple modalities and cortical areas, advancing towards a unified model of cognition.
Contribution
It introduces the first multimodal brain response prediction model trained on diverse stimuli, combining pretrained foundational models with a transformer to handle spatial and temporal dynamics.
Findings
Achieved first place in the Algonauts 2025 brain encoding competition.
Multimodal model outperforms unimodal models in high-level associative cortices.
Demonstrated precise modeling of spatial and temporal fMRI responses to videos.
Abstract
Historically, neuroscience has progressed by fragmenting into specialized domains, each focusing on isolated modalities, tasks, or brain regions. While fruitful, this approach hinders the development of a unified model of cognition. Here, we introduce TRIBE, the first deep neural network trained to predict brain responses to stimuli across multiple modalities, cortical areas and individuals. By combining the pretrained representations of text, audio and video foundational models and handling their time-evolving nature with a transformer, our model can precisely model the spatial and temporal fMRI responses to videos, achieving the first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors. Ablations show that while unimodal models can reliably predict their corresponding cortical networks (e.g. visual or auditory networks), they are…
Peer Reviews
Decision·ICLR 2026 Poster
1. Combines text, audio, and video features for brain prediction, which covers more brain regions than single-modality models. 2. Achieves clear state-of-the-art results on a public benchmark, outperforming all other competitors by a noticeable margin. 3. The model design is practical and scalable, using existing large models and a transformer to handle real-world, naturalistic data.
1. The model relies on combining a large number of models (ensemble), so it’s unclear how well a single model works in practice. 2. Some details in the paper are inconsistent, for example the feature dimensions in Figure 2 don’t match the numbers in the methods section, and the number of teams is sometimes 262 and sometimes 263.
1. The motivation of this article is very good, as it analyzes the entire brain information from a multimodal perspective. It is of great practical significance and better aligns with the data processing procedures in the era of large models. Therefore, it is also very beneficial for researching more general brain foundation data. 2. The experiments are well-conducted, and the performance is also good. Its 1st-place ranking out of 263 teams in a competitive benchmark is the strongest and most o
1. The article does not disclose or discuss the complexity of the model. To the best of my knowledge, many previous brain decoding projects employed relatively small models. However, the current method employs multiple pre-trained large models, and it should provide the overall size of the model so that others can evaluate and use it. 2. Starting from line 48, the first two motivations actually involve many models that no longer use simple regression. Moreover, numerous recent studies have focu
1) The model’s architecture is not a brute-force fusion but a theoretically motivated hierarchy: frozen modality experts (V-JEPA2, Wav2Vec2-BERT, LLaMA3.2) adapted into a shared latent space, followed by a transformer capturing temporal and intersubject alignment. The multisubject encoder and modality dropout mechanisms seem particularly well thought out, improving both biological plausibility and statistical efficiency 2) The multimodal integration analysis is both methodologically clear and s
1) V-JEPA activations are averaged over patches to manage compute, and the authors themselves expect degradation in low-level, retinotopic cortex. That choice complicates interpreting modality-dominance maps in early vision. It would make the paper stronger if the authors can provide some clarity here. 2) The normalized Pearson metric (Eq. 1, claiming TRIBE captures "54% of explainable variance") relies on test-retest reliability (ρself) computed from only two repeated movies: Hidden Figures an
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
