VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning
Ziyang Meng, Yu Dai, Zezheng Gong, Shaoxiong Guo, Minglong Tang,, Tongquan Wei

TL;DR
VGA is a fine-tuned vision-language model specifically designed for GUI understanding, significantly reducing hallucinations and improving accuracy in interpreting structured visual data through a new dataset and a two-stage fine-tuning process.
Contribution
The paper introduces VGA, a novel fine-tuning approach with a new dataset and a two-stage method to improve GUI comprehension and reduce hallucinations in vision-language models.
Findings
VGA achieves state-of-the-art results in GUI understanding tasks.
The Referent Method ensures responses depend on visual content.
Two-stage fine-tuning enhances information extraction and alignment with human intent.
Abstract
Recent advances in Large Vision-Language Models (LVLMs) have significantly improve performance in image comprehension tasks, such as formatted charts and rich-content images. Yet, Graphical User Interface (GUI) pose a greater challenge due to their structured format and detailed textual information. Existing LVLMs often overly depend on internal knowledge and neglect image content, resulting in hallucinations and incorrect responses in GUI comprehension. To address these issues, we introduce VGA, a fine-tuned model designed for comprehensive GUI understanding. Our model aims to enhance the interpretation of visual data of GUI and reduce hallucinations. We first construct a Vision Question Answering (VQA) dataset of 63.8k high-quality examples with our propose Referent Method, which ensures the model's responses are highly depend on visual content within the image. We then design a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFunctional Brain Connectivity Studies · Hallucinations in medical conditions · Treatment of Major Depression
