Lightweight Neural App Control
Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao,, Jun Wang, Kun Shao

TL;DR
This paper presents LiMAC, a lightweight, efficient mobile app control system that uses a small Action Transformer and vision-language model to improve task accuracy on Android devices, outperforming larger models and prompt-based baselines.
Contribution
Introduction of LiMAC, a novel lightweight architecture combining an Action Transformer with a vision-language model for real-time mobile app control.
Findings
LiMAC achieves up to 19% higher action accuracy than fine-tuned VLMs.
LiMAC outperforms prompt engineering baselines by up to 42%.
The approach is computationally efficient for smartphone deployment.
Abstract
This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (LiMAC), for efficient interactions and control across various Android apps. LiMAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More…
Peer Reviews
Decision·ICLR 2025 Spotlight
- LiMAC combines a small Action Transformer (AcT) with a fine-tuned vision-language model (VLM). This hybrid approach is tailored to the computational constraints of mobile devices, achieving efficient and accurate control without relying on large, resource-intensive models. The AcT independently handles common actions, while the VLM is selectively employed for complex natural language tasks, optimizing both resource usage and response time. - LiMAC’s modular structure supports the integration o
- Although the paper evaluates LiMAC on two datasets, both datasets are relatively specific to Android applications, potentially limiting the generalizability of results to other operating systems (e.g., iOS) or app control tasks with distinct interface designs. - The paper does not provide an extensive scalability analysis of LiMAC’s architecture as task complexity or the number of available UI elements increases, which may impact its robustness in more complex or densely populated app environ
1. **Novel design.** The authors designed a lightweight module to predict the type of actions to be taken, and execute simple actions with this light-weight module directly. Leaving the VLM to solve complex tasks that involve text generation. This leads to both performance speed-up and better accuracy. 2. **Thorough evaluations.** I like how the authors compared using AcT/VLM for different tasks, clearly demonstrating the performance gain by adopting the current design, which makes sense to me.
1. **Limited Dataset and Tasks.** The authors used two datasets of relatively small size, this paper could benefit from larger-scale experiments and maybe real-world user studies. 2. Due to the limited data size, the proposed model may have additional difficulties in solving difficult tasks (which is where the mobile AI is needed to most, from my opinion). More studies/analysis on failure mode could make this paper better.
The paper is incredibly clear and well-written. I am not an expert on Android or UI agents, but it was obvious what the contribution was, and why it was important. The small number of parameters in LiMAC make it clear how this advances the ability of users to run advanced Android control models directly on devices. The way the information is encoded is also very thoroughly described. It is somewhat difficult for me to evaluate novelty, but, assuming the related works section does not have any gl
I am not an expert in this area, so it is difficult for me to point out major weaknesses—the paper clearly states a contribution, and is very self-contained. However, there are several minor things that were unclear to me, where the paper might benefit from more detail: 1. The paper mentions that positional encodings are used to represent the nesting of UI elements. How is this done, exactly? 2. It would be good to know what the "ten distinct action types" are—it seems like the same few example
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsLinear Layer · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Attention Is All You Need · Dense Connections · Softmax · Multi-Head Attention · Adam · Dropout
