See-Control: A Multimodal Agent Framework for Smartphone Interaction with a Robotic Arm

Haoyu Zhao; Weizhong Ding; Yuhao Yang; Zheng Tian; Linyi Yang; Kun Shao; Jun Wang

arXiv:2512.08629·cs.AI·December 10, 2025

See-Control: A Multimodal Agent Framework for Smartphone Interaction with a Robotic Arm

Haoyu Zhao, Weizhong Ding, Yuhao Yang, Zheng Tian, Linyi Yang, Kun Shao, Jun Wang

PDF

Open Access

TL;DR

See-Control introduces a multimodal agent framework that enables smartphone interaction through a robotic arm without relying on Android Debug Bridge, facilitating physical-world applications and advancing multimodal large language model capabilities.

Contribution

The paper presents a novel platform-agnostic framework with a new benchmark, dataset, and an embodied agent for physical smartphone interaction using a robotic arm.

Findings

01

Successfully performs 155 diverse tasks in the ESO benchmark.

02

Operates without ADB or system back-end access.

03

Provides a new dataset for future research in embodied AI.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled their use as intelligent agents for smartphone operation. However, existing methods depend on the Android Debug Bridge (ADB) for data transmission and action execution, limiting their applicability to Android devices. In this work, we introduce the novel Embodied Smartphone Operation (ESO) task and present See-Control, a framework that enables smartphone operation via direct physical interaction with a low-DoF robotic arm, offering a platform-agnostic solution. See-Control comprises three key components: (1) an ESO benchmark with 155 tasks and corresponding evaluation metrics; (2) an MLLM-based embodied agent that generates robotic control commands without requiring ADB or system back-end access; and (3) a richly annotated dataset of operation episodes, offering valuable resources for future research. By bridging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Speech and dialogue systems