VisualChef: Generating Visual Aids in Cooking via Mask Inpainting

Oleh Kuzyk; Zuoyue Li; Marc Pollefeys; Xi Wang

arXiv:2506.18569·cs.CV·October 7, 2025

VisualChef: Generating Visual Aids in Cooking via Mask Inpainting

Oleh Kuzyk, Zuoyue Li, Marc Pollefeys, Xi Wang

PDF

TL;DR

VisualChef is a novel method that generates contextual visual aids for cooking by using mask inpainting to produce images of actions and outcomes, maintaining environmental consistency without relying on detailed textual annotations.

Contribution

It introduces a mask-based visual grounding approach for generating cooking visual aids, simplifying alignment and enabling targeted modifications based on action relevance.

Findings

01

Outperforms state-of-the-art methods quantitatively

02

Provides high-quality visual aids in cooking scenarios

03

Works effectively across multiple egocentric video datasets

Abstract

Cooking requires not only following instructions but also understanding, executing, and monitoring each step - a process that can be challenging without visual guidance. Although recipe images and videos offer helpful cues, they often lack consistency in focus, tools, and setup. To better support the cooking process, we introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios. Given an initial frame and a specified action, VisualChef generates images depicting both the action's execution and the resulting appearance of the object, while preserving the initial frame's environment. Previous work aims to integrate knowledge extracted from large language models by generating detailed textual descriptions to guide image generation, which requires fine-grained visual-textual alignment and involves additional annotations. In contrast, VisualChef…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.