StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

Xuanyue Zhong; Yuqiang Xie; Guanqun Bi; Jiangping Yang; Guibin Chen

arXiv:2604.23198·cs.AI·April 28, 2026

StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

Xuanyue Zhong, Yuqiang Xie, Guanqun Bi, Jiangping Yang, Guibin Chen

PDF

TL;DR

StoryTR introduces a narrative-centric video retrieval benchmark emphasizing Theory of Mind reasoning, highlighting the importance of implicit mental state understanding for improved performance.

Contribution

The paper presents the first ToM-based video retrieval benchmark and a training pipeline that enhances narrative reasoning in models, surpassing scale-based improvements.

Findings

01

Shorts-Moment model improves +15.1% IoU over baselines.

02

Gemini-3.0-Pro achieves only 0.53 Avg IoU on StoryTR.

03

ToM reasoning is crucial for narrative video understanding.

Abstract

Current video moment retrieval excels at action-centric tasks but struggles with narrative content. Models can see \textit{what is happening} but fail to reason \textit{why it matters}. This semantic gap stems from the lack of \textbf{Theory of Mind (ToM)}: the cognitive ability to infer implicit intentions, mental states, and narrative causality from surface-level observations. We introduce \textbf{StoryTR}, the first video moment retrieval benchmark requiring ToM reasoning, comprising 8.1k samples from narrative short-form videos (shorts/reels). These videos present an ideal testbed. Their high information density encodes meaning through subtle multimodal cues. For instance, a glance paired with a sigh carries entirely different semantics than the glance alone. Yet multimodal perception alone is insufficient; ToM is required to decode that a character ``smiling'' may actually be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.