Beyond Syntax: Action Semantics Learning for App Agents
Bohan Tang, Dezhao Luo, Jianheng Liu, Jingxuan Chen, Shaogang Gong, Jianye Hao, Jun Wang, Kun Shao

TL;DR
This paper introduces Action Semantics Learning (ASL), a new framework for training App agents that focuses on understanding the semantics of actions rather than exact syntax, enhancing robustness and generalization.
Contribution
The paper proposes a semantic-based learning framework with a novel Semantic Estimator, improving out-of-distribution robustness and performance of App agents over traditional syntax-based methods.
Findings
ASL improves accuracy of App agents across multiple benchmarks.
ASL demonstrates superior robustness to out-of-distribution actions.
Theoretical analysis confirms ASL's enhanced OOD robustness.
Abstract
The recent development of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with proprietary LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current supervised fine-tuning methods use a syntax learning paradigm that forces agents to reproduce exactly the ground truth action strings, leading to out-of-distribution (OOD) vulnerability. To fill this gap, we propose Action Semantics Learning (ASL), a novel learning framework, where the learning objective is capturing the semantics of the ground truth actions. Specifically, inspired by the programming language theory, we define the action semantics for App agents as the state transition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
