EgoBrain: Synergizing Minds and Eyes For Human Action Understanding

Nie Lin; Yansen Wang; Dongqi Han; Weibang Jiang; Jingyuan Li; Ryosuke Furuta; Yoichi Sato; Dongsheng Li

arXiv:2506.01353·cs.AI·October 15, 2025

EgoBrain: Synergizing Minds and Eyes For Human Action Understanding

Nie Lin, Yansen Wang, Dongqi Han, Weibang Jiang, Jingyuan Li, Ryosuke Furuta, Yoichi Sato, Dongsheng Li

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

EgoBrain introduces a large-scale, multimodal dataset combining EEG and first-person video to advance human action understanding through a novel fusion framework, achieving significant recognition accuracy.

Contribution

The paper presents the first extensive synchronized EEG and egocentric video dataset and a multimodal learning framework for improved human action recognition.

Findings

01

Achieved 66.70% action recognition accuracy.

02

Provided a large, openly shared multimodal dataset.

03

Validated framework across subjects and environments.

Abstract

The integration of brain-computer interfaces (BCIs), in particular electroencephalography (EEG), with artificial intelligence (AI) has shown tremendous promise in decoding human cognition and behavior from neural signals. In particular, the rise of multimodal AI models have brought new possibilities that have never been imagined before. Here, we present EgoBrain --the world's first large-scale, temporally aligned multimodal dataset that synchronizes egocentric vision and EEG of human brain over extended periods of time, establishing a new paradigm for human-centered behavior analysis. This dataset comprises 61 hours of synchronized 32-channel EEG recordings and first-person video from 40 participants engaged in 29 categories of daily activities. We then developed a muiltimodal learning framework to fuse EEG and vision for action understanding, validated across both cross-subject and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper contributes a new dataset, EgoBrain which has synchronized video and EEG signals which can be valuable for computer vision research. 2. The paper shows that that EEG signals can be a useful modality for tasks such as action recognition when the visual modality is occluded. 3. The paper shows analysis on cross-subject and cross-subject & cross-scene analysis which is a challenging benchmark to evaluate the model generalization.

Weaknesses

1. The architecture method of Brain-TIM seems to be incremental when compared to TIM [1]. The architecture presented in the paper of modality-specific encoders, embedding layers, Time-Interval MLP, and a Transformer encoder seems to be a direct application of the existing TIM framework to a new pair of modalities. Can the authors clarify the differences between TIM and Brain-TIM? Is Brain-TIM just an extension of TIM to multiple modalities? 2. While the idea of using EEG signal to understand ego

Reviewer 02Rating 4Confidence 4

Strengths

This review evaluates the paper's quality based on the following criteria: task relevance, related work, technical novelty, technical correctness, experimental validation, writing and presentation, and reproducibility. Each aspect is discussed and highlighted as a strength or a weakness in the sections below. - **Dataset Contribution and Reproducibility:** This paper contributes to the community a dataset of synchronized egocentric videos and 32-channel encephalography recordings. However, it

Weaknesses

- **Relevance of the task and Experimental Validation:** Even though Action Classification from egocentric videos and encephalography recordings may be a relevant problem for the ICLR community. The motivation behind including this novel data type modality is not well stated in the paper's introduction. This paper already reports high performance for the proposed task, so it may probably saturate fast. Considering these results, what are the reasons to keep the data acquisition as simple as p

Reviewer 03Rating 4Confidence 5

Strengths

1. First-of-its-kind dataset integrating real-world egocentric vision with EEG; extensive and ethically curated. 2. Clear methodological design with well-justified architecture choices (temporal embeddings, modality-aware tokens). 3. Insightful qualitative results showing when EEG signals complement vision (e.g., occlusion or intent disambiguation).

Weaknesses

1. The synchronization precision is stated as <1s jitter. This is a relatively large jitter for fast-changing neural signals and short actions. This level of jitter could potentially limit the precise time-locking necessary for analyzing rapid neural correlates of action initiation or error. A discussion on how the Brain-TIM model's windowing strategy mitigates the impact of this 1s jitter is needed. 2. The model’s novelty is limited, Brain-TIM primarily applies existing time-embedding concepts

Code & Models

Datasets

ut-vision/EgoBrain
dataset· 133 dl
133 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplex Systems and Decision Making