TL;DR
PlanViz introduces a benchmark for evaluating image generation and editing capabilities of models in computer-use planning tasks like route planning and UI display, emphasizing spatial reasoning and procedural understanding.
Contribution
The paper presents a new benchmark, PlanViz, with a task-adaptive scoring method, to assess UMMs' performance on computer-use planning tasks involving image generation and editing.
Findings
Experiments reveal current limitations of UMMs in planning tasks.
The benchmark highlights opportunities for improving spatial reasoning in models.
PlanScore effectively measures correctness, visual quality, and efficiency of generated images.
Abstract
Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning. Specifically, three representative sub-tasks are designed: route planning, work diagramming, and web&UI displaying. We address challenges in data quality ensuring by curating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
