See-Control: A Multimodal Agent Framework for Smartphone Interaction with a Robotic Arm
Haoyu Zhao, Weizhong Ding, Yuhao Yang, Zheng Tian, Linyi Yang, Kun Shao, Jun Wang

TL;DR
See-Control introduces a multimodal agent framework that enables smartphone interaction through a robotic arm without relying on Android Debug Bridge, facilitating physical-world applications and advancing multimodal large language model capabilities.
Contribution
The paper presents a novel platform-agnostic framework with a new benchmark, dataset, and an embodied agent for physical smartphone interaction using a robotic arm.
Findings
Successfully performs 155 diverse tasks in the ESO benchmark.
Operates without ADB or system back-end access.
Provides a new dataset for future research in embodied AI.
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have enabled their use as intelligent agents for smartphone operation. However, existing methods depend on the Android Debug Bridge (ADB) for data transmission and action execution, limiting their applicability to Android devices. In this work, we introduce the novel Embodied Smartphone Operation (ESO) task and present See-Control, a framework that enables smartphone operation via direct physical interaction with a low-DoF robotic arm, offering a platform-agnostic solution. See-Control comprises three key components: (1) an ESO benchmark with 155 tasks and corresponding evaluation metrics; (2) an MLLM-based embodied agent that generates robotic control commands without requiring ADB or system back-end access; and (3) a richly annotated dataset of operation episodes, offering valuable resources for future research. By bridging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Speech and dialogue systems
