EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos
Ruiping Liu, Junwei Zheng, Yufan Chen, Di Wen, Shaofang Quan, Chengzhi Wu, Jiaming Zhang, Kailun Yang, Kunyu Peng, Rainer Stiefelhagen

TL;DR
EgoExoMem introduces a new benchmark for cross-view memory reasoning using synchronized egocentric and exocentric videos, highlighting the challenges and potential of dual-view cues in embodied intelligence.
Contribution
It presents the first benchmark for cross-view memory reasoning and proposes E^2-Select, a novel frame selection method for synchronized videos.
Findings
Existing models perform poorly on the benchmark, with the best at 55.3%.
E^2-Select outperforms other frame-selection and memory baselines, achieving 58.2%.
Experiments show complementary cues from ego and exo views, with view-preference conflicts.
Abstract
Egocentric memory is widely used in embodied intelligence, but it may be insufficient for comprehensive spatial-temporal reasoning. Inspired by human recall from both field and observer perspectives, we introduce EgoExoMem, the first benchmark for cross-view memory reasoning over synchronized egocentric and exocentric videos. EgoExoMem contains high-quality MCQs across eight temporal, spatial, and cross-view QA types. To support dual-view retrieval, we propose E-Select, a training-free frame selection method for synchronized ego-exo videos. It combines relevance-based budget allocation with per-view k-DPP sampling to handle view asymmetry and cross-view temporal consistency. Experiments show that ego and exo views provide complementary memory cues, while existing MLLMs remain far from solving the benchmark: the best model reaches only . E-Select achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
