TL;DR
MARS is a multimodal reasoning system designed for the CASTLE Challenge, integrating diverse sources like videos, transcripts, and auxiliary data to answer complex questions over multiple days.
Contribution
The paper introduces MARS, a novel agentic evidence-selection approach that effectively handles multimodal data for complex reasoning tasks in egocentric benchmarks.
Findings
Achieved second place on the CASTLE Challenge leaderboard.
Effectively integrates multiple modalities including videos, transcripts, gaze, and thermal imagery.
Uses GPT-5.4 as a decision agent for evidence selection and reasoning.
Abstract
This report presents MARS, short for Multimodal Agentic Reasoning with Source selection, our system for the CASTLE Challenge at EgoVis 2026. Participants must answer 185 closed-form questions over the CASTLE 2024 dataset. In contrast to prior single-video egocentric benchmarks, CASTLE requires reasoning over four days of activity, 15 synchronized perspectives, official transcripts, and multiple auxiliary modalities, including personal photos, auxiliary videos, gaze, thermal imagery, and heartrate measurements. MARS therefore treats the task as an agentic evidence-selection problem over multimodal sources rather than a purely text-only pipeline. MARS first follows the official CASTLE directory organization to build evidence memories from two primary sources, videos and transcripts, and four auxiliary sources, gaze, heartrate, photos, and thermal imagery. Long videos are converted into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
