Tuning Qwen2.5-VL to Improve Its Web Interaction Skills
Alexandra Yakovleva, Henrik P\"arssinen, Harri Valpola, Juho Kannala, Alexander Ilin

TL;DR
This paper enhances the web interaction capabilities of Qwen2.5-VL by fine-tuning it for more reliable element localization and action verification, significantly improving success rates in web tasks.
Contribution
It introduces a two-stage fine-tuning pipeline that improves Qwen2.5-VL's accuracy and reliability in web-based control tasks, addressing key challenges in localization and action success verification.
Findings
Success rate increased from 86% to 94% on challenging web tasks.
Fine-tuning improves localization of target elements and action outcome verification.
The approach enhances the model's robustness in web interaction scenarios.
Abstract
Recent advances in vision-language models (VLMs) have sparked growing interest in using them to automate web tasks, yet their feasibility as independent agents that reason and act purely from visual input remains underexplored. We investigate this setting using Qwen2.5-VL-32B, one of the strongest open-source VLMs available, and focus on improving its reliability in web-based control. Through initial experimentation, we observe three key challenges: (i) inaccurate localization of target elements, the cursor, and their relative positions, (ii) sensitivity to instruction phrasing, and (iii) an overoptimistic bias toward its own actions, often assuming they succeed rather than analyzing their actual outcomes. To address these issues, we fine-tune Qwen2.5-VL-32B for a basic web interaction task: moving the mouse and clicking on a page element described in natural language. Our training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
