ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

Qi'ao Xu; Tianwen Qian; Yuqian Fu; Kailing Li; Yang Jiao; Jiacheng Zhang; Xiaoling Wang; Liang He

arXiv:2512.03666·cs.CV·April 7, 2026

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

Qi'ao Xu, Tianwen Qian, Yuqian Fu, Kailing Li, Yang Jiao, Jiacheng Zhang, Xiaoling Wang, Liang He

PDF

1 Repo

TL;DR

ToG-Bench is a new benchmark for task-oriented spatio-temporal grounding in egocentric videos, emphasizing goal-directed object localization and reasoning for embodied AI.

Contribution

It introduces the first task-oriented STVG benchmark with explicit-implicit and multi-object grounding, along with evaluation metrics and systematic benchmarking of state-of-the-art models.

Findings

01

Significant performance gaps in current models for task-oriented grounding.

02

Challenges in bridging perception and interaction in embodied scenarios.

03

Intrinsic difficulty of explicit-implicit and multi-object grounding tasks.

Abstract

A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object-centric and descriptive instructions, neglecting the task-oriented reasoning that is crucial for embodied agents to accomplish goal-directed interactions. To bridge this gap, we introduce \textbf{ToG-Bench}, the first task-oriented spatio-temporal video grounding benchmark for egocentric videos. ToG-Bench is characterized by three key features: (1) \textbf{Task-oriented Grounding}, which requires identifying and localizing objects based on intended tasks rather than straightforward descriptions; (2) \textbf{Explicit-Implicit Dual Grounding}, where target objects can be either explicitly mentioned or implicitly inferred by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qaxuDev/ToG-Bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.