Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Keen You; Haotian Zhang; Eldon Schoop; Floris Weers; Amanda Swearngin,; Jeffrey Nichols; Yinfei Yang; Zhe Gan

arXiv:2404.05719·cs.CV·April 9, 2024·3 cites

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin,, Jeffrey Nichols, Yinfei Yang, Zhe Gan

PDF

Open Access 2 Models

TL;DR

Ferret-UI is a specialized multimodal large language model designed for detailed understanding and interaction with mobile UI screens, outperforming existing models and GPT-4V on various UI tasks.

Contribution

The paper introduces Ferret-UI, a tailored MLLM for mobile UI understanding with novel multi-resolution encoding and extensive UI-specific training datasets.

Findings

01

Ferret-UI surpasses most open-source UI MLLMs in comprehension.

02

Ferret-UI outperforms GPT-4V on elementary UI tasks.

03

The model demonstrates strong reasoning and interaction capabilities.

Abstract

Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems