MindShot: Multi-Shot Video Reconstruction from fMRI with LLM Decoding
Wenwen Zeng, Yonghuang Wu, Yifan Chen, Xuan Xie, Chengqian Zhao, Feiyu Yin, Guoqing Wu, Jinhua Yu

TL;DR
This paper introduces a novel framework for reconstructing multi-shot videos from fMRI data by explicitly decomposing signals, using large-scale data synthesis, and leveraging large language models for semantic captioning, significantly improving reconstruction fidelity.
Contribution
The paper presents a divide-and-decode framework with shot boundary prediction and LLM-based captioning, addressing multi-shot challenges in fMRI video reconstruction for the first time.
Findings
Outperforms state-of-the-art in multi-shot reconstruction fidelity.
fMRI decomposition improves caption CLIP similarity by 71.8%.
Semantic captioning enables accurate visual narrative recovery.
Abstract
Reconstructing dynamic videos from fMRI is important for understanding visual cognition and enabling vivid brain-computer interfaces. However, current methods are critically limited to single-shot clips, failing to address the multi-shot nature of real-world experiences. Multi-shot reconstruction faces fundamental challenges: fMRI signal mixing across shots, the temporal resolution mismatch between fMRI and video obscuring rapid scene changes, and the lack of dedicated multi-shot fMRI-video datasets. To overcome these limitations, we propose a novel divide-and-decode framework for multi-shot fMRI video reconstruction. Our core innovations are: (1) A shot boundary predictor module explicitly decomposing mixed fMRI signals into shot-specific segments. (2) Generative keyframe captioning using LLMs, which decodes robust textual descriptions from each segment, overcoming temporal blur by…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper explicitly addresses multi-shot fMRI–to–video reconstruction, a realistic yet previously overlooked setting. The shift from video-level to shot-level decoding is conceptually meaningful. - Although synthetic, the attempt to create large-scale multi-shot datasets is valuable for promoting research in this direction. Clear ablation structure
- The “shot boundary predictor” assumes discrete scene transitions in fMRI, but fMRI has low temporal resolution (TR ≈ 2 s), making such segmentation biologically implausible. No neurophysiological evidence or subject-level analysis supports that shot transitions are detectable at this timescale. - Improvements are small and inconsistent (e.g., SSIM changes are marginal). Metrics like “2-way” and “50-way classification” are unclear proxies for perceptual fidelity, and no statistical tests are pr
- The reported metrics are better than previous works.
- Short clip segments is a property of the experiments and the way data was collected, I don't think they are central to the problem of fMRI video decoding. A video decoding technic can use a tool for identifying scene cuts, but for me it seems like a second order thing, not the core of the problem. I don't think it is justified to devote a whole paper to this issue. - Other than the scene cuts detection, I don't see how this paper differs significantly from previous works. There are minor impro
Originality:The primary strength is the high originality of proposing a shot-level paradigm. The idea of explicitly decomposing fMRI signals before decoding, moving beyond video-level alignment, is a novel and creative formulation of the problem. Quality:The technical approach is methodical, and the experimental evaluation is comprehensive. The inclusion of ablation studies strengthens the paper by quantitatively validating the importance of key components like the shot boundary predictor. Cla
1、 The reconstruction pipeline relies exclusively on semantic captions decoded from fMRI to condition the text-to-video (T2V) generative model. While this approach effectively ensures high-level semantic consistency, it may introduce a significant limitation: the absence of direct constraints from low-level visual features present in the original fMRI signals. This could result in a loss of perceptual detail and fidelity in the reconstructed videos, as the T2V model is guided solely by textual p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
