An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

Georgios Pantazopoulos; Eda B. \"Ozyi\u{g}it

arXiv:2511.08172·cs.AI·November 17, 2025

An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

Georgios Pantazopoulos, Eda B. \"Ozyi\u{g}it

PDF

Open Access

TL;DR

This paper presents an efficient training pipeline for reasoning GUI agents that combines data filtering and parameter-efficient fine-tuning, achieving high performance with significantly less data and computational resources.

Contribution

It introduces a novel data curation method and training strategy for multimodal reasoning models that outperform larger baselines on multiple benchmarks.

Findings

01

Filtered data improves model performance.

02

Lightweight training matches larger models.

03

Principled data curation enables efficient reasoning agents.

Abstract

Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic datasets. This work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes: supervised fine-tuning, chain-of-thought-augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baselines on benchmarks such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning