VisualChef: Generating Visual Aids in Cooking via Mask Inpainting
Oleh Kuzyk, Zuoyue Li, Marc Pollefeys, Xi Wang

TL;DR
VisualChef is a novel method that generates contextual visual aids for cooking by using mask inpainting to produce images of actions and outcomes, maintaining environmental consistency without relying on detailed textual annotations.
Contribution
It introduces a mask-based visual grounding approach for generating cooking visual aids, simplifying alignment and enabling targeted modifications based on action relevance.
Findings
Outperforms state-of-the-art methods quantitatively
Provides high-quality visual aids in cooking scenarios
Works effectively across multiple egocentric video datasets
Abstract
Cooking requires not only following instructions but also understanding, executing, and monitoring each step - a process that can be challenging without visual guidance. Although recipe images and videos offer helpful cues, they often lack consistency in focus, tools, and setup. To better support the cooking process, we introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios. Given an initial frame and a specified action, VisualChef generates images depicting both the action's execution and the resulting appearance of the object, while preserving the initial frame's environment. Previous work aims to integrate knowledge extracted from large language models by generating detailed textual descriptions to guide image generation, which requires fine-grained visual-textual alignment and involves additional annotations. In contrast, VisualChef…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
