Autoregressive Visual Decoding from EEG Signals
Sicheng Dai, Hongwang Xiao, Shan Yu, Qiwei Ye

TL;DR
This paper introduces AVDE, an efficient autoregressive framework that decodes visual images from EEG signals by aligning EEG and image representations and generating images hierarchically, outperforming previous methods with fewer parameters.
Contribution
The paper presents a novel autoregressive model for EEG-to-image decoding that leverages pre-trained models and a hierarchical prediction strategy, improving efficiency and interpretability.
Findings
AVDE outperforms state-of-the-art in image retrieval and reconstruction.
Uses only 10% of the parameters of previous models.
Generative process reflects hierarchical visual perception.
Abstract
Electroencephalogram (EEG) signals have become a popular medium for decoding visual information due to their cost-effectiveness and high temporal resolution. However, current approaches face significant challenges in bridging the modality gap between EEG and image data. These methods typically rely on complex adaptation processes involving multiple stages, making it hard to maintain consistency and manage compounding errors. Furthermore, the computational overhead imposed by large-scale diffusion models limit their practicality in real-world brain-computer interface (BCI) applications. In this work, we present AVDE, a lightweight and efficient framework for visual decoding from EEG signals. First, we leverage LaBraM, a pre-trained EEG model, and fine-tune it via contrastive learning to align EEG and image representations. Second, we adopt an autoregressive generative framework based on…
Peer Reviews
Decision·ICLR 2026 Poster
+ This work thoughtfully addresses the challenge of computational overhead seen in models like diffusion, achieving remarkable results with just 10% of the parameters. + Moreover, the autoregressive approach shines with its simplicity and efficiency. By utilizing the EEG embedding as the initial token for the transformer, it establishes a clever architectural foundation. + This model outperforms more complex multi-stage and diffusion-based counterparts in both retrieval and reconstruction ta
- Autoregressive models can sometimes face the challenge of error accumulation, where an initial mistake in an early token or pixel patch can amplify throughout the rest of the generation. State how we can address this. - While this approach offers great efficiency and performs well on standard metrics, we may notice that the qualitative fidelity of its reconstructions doesn’t always match that of larger, slower models, such as diffusion, in more complex scenes. It's all about striking a balanc
1. The combination of a large-scale pre-trained EEG encoder (LaBraM) and an autoregressive generative model (VAR) is a technically reasonable and promising direction. Pretraining-based decoding is becoming a general paradigm, and applying it here may help bridge the data scarcity of EEG. 2. VAR offers computational advantages over diffusion-based pipelines, including lower latency and reduced parameter count. The paper quantifies these gains and reports consistent, if moderate, improvements in r
1. The research motivation remains underdeveloped. While AVDE is presented as an alternative to diffusion models, it does not fully resolve the fundamental challenges of EEG-to-image decoding. The work should be viewed as an exploratory step rather than a definitive advance. 2. The methodological innovation is limited. The approach primarily substitutes diffusion with an autoregressive transformer and incorporates LaBraM pretraining. These are incremental modifications rather than new modeling i
1. **next-scale prediction**. AVDE has an interesting idea (“next-scale prediction”, rather than traditional multi-stage diffusion models). It is conceptually novel. Unlike earlier methods that suffer from error propagation across multiple stages, AVDE generates visual content progressively, starting from coarse EEG embeddings and refining them to more detailed image representations. This approach improves coherence between EEG inputs and reconstructed images. 2.**Performance** The proposed met
1. The paper’s strongest results are within-subject; cross-subject transfer remains limited and the manuscript offers little analysis of what factors drive between-subject variance. Minor 2. The efficiency comparison fixes diffusion at 50+4 steps and specific CFG/top-k settings while the proposed AR approach uses 10 steps; modern diffusion samplers (e.g., DDIM, DPM-Solver/DPMS++) can operate at 10–20 steps with competitive quality, and lighter priors are possible. As written, the comparison ris
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEEG and Brain-Computer Interfaces · Emotion and Mood Recognition · Neural dynamics and brain function
