Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin,, Jeffrey Nichols, Yinfei Yang, Zhe Gan

TL;DR
Ferret-UI is a specialized multimodal large language model designed for detailed understanding and interaction with mobile UI screens, outperforming existing models and GPT-4V on various UI tasks.
Contribution
The paper introduces Ferret-UI, a tailored MLLM for mobile UI understanding with novel multi-resolution encoding and extensive UI-specific training datasets.
Findings
Ferret-UI surpasses most open-source UI MLLMs in comprehension.
Ferret-UI outperforms GPT-4V on elementary UI tasks.
The model demonstrates strong reasoning and interaction capabilities.
Abstract
Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
