An Efficient Training Pipeline for Reasoning Graphical User Interface Agents
Georgios Pantazopoulos, Eda B. \"Ozyi\u{g}it

TL;DR
This paper presents an efficient training pipeline for reasoning GUI agents that combines data filtering and parameter-efficient fine-tuning, achieving high performance with significantly less data and computational resources.
Contribution
It introduces a novel data curation method and training strategy for multimodal reasoning models that outperform larger baselines on multiple benchmarks.
Findings
Filtered data improves model performance.
Lightweight training matches larger models.
Principled data curation enables efficient reasoning agents.
Abstract
Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic datasets. This work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes: supervised fine-tuning, chain-of-thought-augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baselines on benchmarks such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
