TVWorld: Foundations for Remote-Control TV Agents

Zhantao Ma; Quanfeng Lu; Shuai Zhong; Dahai Yu; Ping Luo; Michael K. Ng

arXiv:2601.13142·cs.CV·January 21, 2026

TVWorld: Foundations for Remote-Control TV Agents

Zhantao Ma, Quanfeng Lu, Shuai Zhong, Dahai Yu, Ping Luo, Michael K. Ng

PDF

Open Access 1 Datasets

TL;DR

This paper introduces TVWorld, a comprehensive benchmark suite for remote-control TV navigation, revealing limitations of current models and proposing a topology-aware training framework that significantly improves TV navigation performance.

Contribution

The paper presents TVWorld benchmarks, identifies topology awareness as a key challenge, and develops TVTheseus with topology-aware training to advance TV navigation models.

Findings

01

TVTheseus achieves 68.3% success on TVWorld-N

02

Topology awareness improves long-horizon TV navigation

03

TVWorld benchmarks reveal limitations of existing agents

Abstract

Recent large vision-language models (LVLMs) have demonstrated strong potential for device control. However, existing research has primarily focused on point-and-click (PnC) interaction, while remote-control (RC) interaction commonly encountered in everyday TV usage remains largely underexplored. To fill this gap, we introduce \textbf{TVWorld}, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: \textbf{TVWorld-N} for topology-aware navigation and \textbf{TVWorld-G} for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a \emph{Topology-Aware Training} framework that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

hflqf88888/TVWorld
dataset· 993 dl
993 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Gaze Tracking and Assistive Technology · Domain Adaptation and Few-Shot Learning